Byoung-Tak  Zhang 
Mehmet  A.  Orgun  (Eds.) 


PRICAI 2010: 

Trends  in 

Artificial  Intelligence 


11th  Pacific  Rim  International  Conference  on  Artificial  Intelligence 

Daegu,  Korea,  August/September  2010 

Proceedings 


REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OMB  No.  0704-0188 

The  public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources 

gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this 
collection  of  information,  including  suggestions  for  reducing  the  burden,  to  Department  of  Defense.  Washington  Headquarters  Services  Directorate  for  Information  Operations  and 
Reports  (0704-0100),  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington.  VA  22202-4302  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law  no  person 
shall  be  subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  information  if  it  does  not  display  a  currently  valid  OMB  control  number 

PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS 

1.  REPORT  DATE  (DD-MM-YYYY)  2.  REPORT  TYPE 

08-09-2010  Conference  Proceedings 

3  DATES  COVERED  (From  -  To) 

30-Aug-10  -  02-Sep-10 

4  TITLE  AND  SUBTITLE 

PRICAI  2010  The  1 1th  Pacific  Rim  International  Conference  on 
Artificial  Intelligence 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

FA2386101 1038 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

Byoung-Tak  Zhang  and  Mehmet  A  Orgun  (Eds  ) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

School  of  Computer  Science  and  Engineering,  Kyungpook  National 
University 

Sankyunk-Dong  1370,  Buk-Gu 

Daegu  702-701 

Korea 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

N/A 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

AOARD 

UNIT  45002 

APO  AP  96338-5002 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

AOARD 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

CSP-101038 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release,  U  S  government  purpose  rights 

13.  SUPPLEMENTARY  NOTES 

Springer  ©2010,  Springer-Verlag;  Berlin  The  U.S  Government  has  a  non-exclusive  license  rights  to  use,  modify, 
reproduce,  release,  perform,  display,  or  disclose  these  materials,  and  to  authorize  others  to  do  so  for  US  Government 
purposes  only.  All  other  rights  reserved  by  the  copyright  holder 

14  abstract  PRICAI  is  a  biannual  conference  on  Pacific  Rim’s  artificial  intelligence  conference  There  were  69 
papers  accepted,  out  of  which  48  were  orally  presented  and  21  were  poster-presented.  This  volume  contains 
these  69  papers  plus  summaries  of  1  key  note  speech  and  3  invited  talks.  The  topics  covered  include  Al 
foundations,  Applications  of  Al,  Agents  Bioinformatics,  Cognitive  modeling  and  human  interaction,  Computer- 
aided  education,  Constraint  satisfaction,  Creativity  support,  Decision  theory,  Evolutionary  computation  Game 
playing  and  interactive  entertainment,  Heuristics,  Information  integration  and  extraction,  Information  retrieval  and 
extraction,  Knowledge  acquisition  and  ontology,  Knowledge  representation,  Machine  learning  and  data  mining 
Model-based  systems,  Multimedia  and  Al,  Natural  language  processing,  Planning  and  scheduling,  Reasoning, 

Robotics  Text/Web  data  mining,  Social  intelligence,  Speech  processing,  Uncertainty,  and  Vision  and  perception. 

15.  SUBJECT  TERMS 


Artificial  Intelligence,  Machine  Learning.  Data  Mining,  Natural  Language  Processing,  Agent  Based  Modeling 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF  RESPONSIBLE  PERSON 

a.  REPORT 

b  ABSTRACT 

c.  THIS  PAGE 

ABSTRACT 

OF  PAGES 

Hiroshi  Motoda,  Ph  D 

U 

U 

U 

UU 

715 

19b  TELEPHONE  NUMBER  (Include  area  code) 

+81-3-5410-4409 

Standard  Form  298  (Rev.  8/98) 

Prescribed  by  ANSI  Std  239  10 


Lecture  Notes  in  Artificial  Intelligence 

Edited  by  R.  Goebel,  J.  Siekmann,  and  W.  Wahlster 
Suhseries  of  Lecture  Notes  in  Computer  Science 


6230 


Byoung-Tak  Zhang  Mehmet  A.  Orgun  (Eds.) 


PRICAI  2010: 

Trends  in 

Artificial  Intelligence 


1  lth  Pacific  Rim  International  Conference 

on  Artificial  Intelligence 

Daegu,  Korea,  August  30  -  September  2,  2010 

Proceedings 


20101130209 


4^  Springer 


Series  Editors* 


Randy  Goebel,  University  of  Alberta,  Edmonton,  Canada 

Jorg  Sickmann,  University  of  Saarland,  Saarbriickcn,  Germany 

Wolfgang  Wahlster,  DFKI  and  University  of  Saarland,  Saarbriicken,  Germany 

Volume  Editors 
Byoung-Tak  Zhang 

School  of  Computer  Science  and  Engineering 
Seoul  National  University 
Seoul,  Korea 

E-mail:  btzhang@bi.snu.ac.kr 

Mehmet  A.  Orgun 
Department  of  Computing 
Macquarie  University 
Sydney,  NSW,  Australia 
E-mail:  mehrnct.orgun@mq.edu.au 


Library  of  Congress  Control  Number  2010932614 


CR  Subject  Classification  (1998):  1.2,  H.3,  H.4,  F.l,  H.2.8,  J.3 
LNCS  Sublibrary:  SL  7  -  Artificial  Intelligence 
ISSN  0302-9743 

ISBN- 10  3-642-15245-7  Springer  Berlin  Heidelberg  New  York 

ISBN- 1 3  978-3-642- 1 5245-0  Springer  Berlin  Heidelberg  New  York 


This  work  is  subject  to  copyright.  Alt  rights  are  reserved,  whether  the  w'hole  or  part  of  the  material  is 
concerned,  specifically  the  rights  of  translation,  reprinting,  rc-use  of  illustrations,  recitation,  broadcasting, 
reproduction  on  microfilms  or  in  any  other  way.  and  storage  in  data  hanks.  Duplication  of  this  publication 
or  parts  thereof  is  permitted  only  under  the  provisions  of  the  German  Copyrighl  Law  of  September  9,  196*5, 
in  its  current  version,  and  permission  for  use  must  always  be  ohtained  from  Springer.  Violations  are  liable 
to  prosecution  under  the  German  Copyright  Law. 

springcr.com 

©  Springer- Verlag  Berlin  Heidelberg  2010 
Printed  in  Germany 

Typesetting:  Camera-ready  by  author,  data  conversion  by  Scientific  Publishing  Services,  Chennai,  India 
Printed  on  acid-frec  paper  06/3 1 80 


Preface 


This  volume  contains  the  papers  presented  at  The  lltli  Pacific  Him  International 
Conference  on  Artificial  Intelligence  (PRICAI  2010)  held  during  August  00 
September  2,  2010  in  Daegu,  one  of  the  most  dynamic  urban  cities  in  Korea 
with  a  rich  traditional  cultural  heritage. 

PRICAI  is  a  biennial  conference  inaugurated  in  Tokyo  in  1990  to  promote 
collaborative  exploitation  of  artificial  intelligence  (Al)  in  the  Pacific  Him  nations. 
Over  the  past  20  years,  the  conference  has  grown,  both  in  participation  and 
scope,  to  be  a  premier  international  Al  event  for  all  major  Pacific  Him  nations  as 
well  as  the  countries  from  all  around  the  world,  highlighting  the  most  significant 
contributions  to  the  field  of  Al.  This  year.  PRICAI  2010  also  featured  several 
special  sessions  on  the  emerging  multi-disciplinary  research  areas  ranging  from 
Evolving  Autonomous  Systems  to  I  Inman- Augmented  Cognition. 

There  was  an  overwhelming  interest  to  the  call  for  papers  for  the  confer¬ 
ence.  As  a  result,  PRICAI  2010  attracted  191  full-paper  submissions  to  the 
regular  session  and  the  special  sessions  of  the  conference  from  researchers  from 
many  regions  of  the  world.  Each  submitted  paper  was  carefully  considered  by 
a  combination  of  several  independent  reviewers,  Program  Committee  members. 
Associate  Chairs,  Program  Vice  Chairs  and  Program  Chairs,  and  finalized  in  a 
highly  selective  process  that  balanced  many  aspects  of  the  paper,  including  the 
significance,  originality,  technical  quality  and  clarity  of  the  contributions,  and 
its  relevance  to  the  conference  topics.  As  a  result,  this  volume  reproduces  48  pa¬ 
pers  that  were  accepted  as  regular  papers  (including  the  special  session  papers) 
and  21  papers  that  were  accepted  as  short  papers.  This  gives  a  regular  paper 
acceptance  rate  of  25.13%,  and  a  short  paper  acceptance  rate  of  10.99%.  with 
an  overall  paper  acceptance  rate  of  36.12%. 

The  regular  papers  were  presented  over  three  days  in  the  topical  program  ses¬ 
sions  and  special  sessions  during  August  31  September  2.  The  short  papers  were 
presented  in  an  interactive  poster  session,  as  well  as  in  a  plenary  session,  con¬ 
tributing  to  a  stimulating  conference  for  all  the  participants.  The  PRICAI  2010 
program  also  featured  The  11th  International  Workshop  on  Knowledge  Manage¬ 
ment  and  Acquisition  for  Smart.  Systems  and  S  enures  (PKAW  2010)  chaired  by 
Paul  Compton  (University  of  New  South  Wales,  Australia)  and  Hiroshi  Motoda 
(Osaka  University,  Japan).  The  PKAW  series  has  been  an  integral  part  of  the 
PRICAI  program  over  the  past  11  years  and  this  year  was  no  exception. 

We  were  also  honored  to  have  keynote  presentations  by  four  distiguished 
researchers  in  the  field  of  Al  whose  contributions  have  crossed  discipline  bound¬ 
aries.  Heinrich  BulthofF  from  Max  Planck  Institute  for  Biological  Cybernet¬ 
ics,  Germany,  talked  on  Towards  Artificial  Systems:  What  Can  We  Learn  from 
Hainan  Perception?:  Mitsuru  Ishizuka  from  University  of  Tokyo.  Japan, 
on  Exploiting  Macro  and  Micro  Relations  Toward  Web  Intelligence ;  Mike 


VI 
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Schuster  from  Google,  USA,  on  Speech  Recognition  for  Mobile  Devices  at  Google ; 
and  Toby  Walsh  from  NICTA.  Australia,  on  Symmetry  Within  and  Between 
Solutions.  We  were  grateful  to  them  for  sharing  their  insights  on  their  latest 
research  with  us. 

The  PRICAI  2010  program  was  the  culmination  of  efforts  expanded  so  will¬ 
ingly  by  numerous  people  from  all  over  the  world  over  the  past  year.  We  would  like 
to  thank  all  the  Program  Vice  Chairs  and  the  Associate  Chairs  for  their  extremely 
hard  work  iri  the  review  process  and  the  Program  Committee  members  and  the 
reviewers  for  a  timely  return  of  their  comprehensive  reviews  of  the  submitted 
papers.  Without  their  help  and  expert  opinions,  it  would  have  been  impossible 
to  make  decisions  on  each  submitted  paper  and  produce  such  a  high-quality  pro¬ 
gram.  We  would  like  to  acknowledge  the  contributions  of  all  the  authors  of  the 
191  submissions  who  made  the  program  possible  in  the  first  place. 

We  would  like  to  thank  the  Conference  General  Chairs,  Jin-Hyung  Kim 
(KAIST,  Korea)  and  Abdul  Sattar  (Griffith  University,  Australia)  for  their 
continued  support  and  guidance,  and  the  Organizing  Chairs  Seong-Bae  Park 
(Kyungpook  National  University,  Korea)  and  Cheol- Young  Ock  (University  of 
Ulsan,  Korea)  for  making  sure  that  the  conference  ran  smoothly.  Thanks  are 
also  due  to: 

—  Special  Sessions  Chairs:  Bob  McKay,  Minho  Lee  and  Michael  Strube 

—  Tutorials  Chairs:  Zlh-Hua  Zhou  and  Kee-Enng  Kim 

—  Workshops  Chairs:  Aditya  Gho.se  and  Shusaku  Tsumoto 

—  Posters  Chairs:  Sanjay  Chawla  and  Kyu-Baek  Hwang 

—  Publications  Chair:  Byeong-Ho  Kang 

—  Treasury  Chair:  Bo-Yeong  Kang 

Publicity  Chairs:  Jung-Jiti  Yang,  Takayuki  Ito,  Zhi  .Jin  and  Pau  Scerri 

Microsoft’s  CMT  conference  management  system  was  used  in  all  stages  of 
the  paper  submission  and  review  process  and  also  in  the  collection  of  the  final 
camera-ready  papers;  it  made  our  life  much  easier. 

We  also  greatly  appreciated  the  financial  support  from  Air  Force  Office  of 
the  Scientific  Research/ Asian  Office  of  Aerospace  Research  and  Development 
(AFOSR/AOARD),  Office  of  Naval  Research  Global  (ONRG),  National  Re¬ 
search  Foundation  of  Korea,  ETRI,  LG  CNS,  KT,  Soongsil  University,  Soft  on 
Net,  Saltlux,  CRH  for  Human,  Cognition  and  Environment,  Daegu  Conven¬ 
tion  &;  Visitors  Bureau,  and  Korea  Tourism  Organization. 

Special  thanks  go  to  Min  Su  Lee  (Seoul  National  University,  Korea)  for  sui>- 
porting  the  committees  so  effectively;  her  dedication  and  resourcefulness  made 
all  the  difference  at  several  critical  junctions  of  the  whole  process. 

It  has  been  a  great  pleasure  for  us  to  serve  as  the  Program  Chairs  of  PRI¬ 
CAI  2010  and  to  present  a  high-quality  scientific  program  for  the  benefit  of  the 
participants  of  the  conference  as  well  as  the  readers  of  this  proceedings  volume. 
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Abstract.  Research  in  learning  algorithms  and  sensor  hardware  has  led  to  rapid 
advances  in  artificial  systems  over  the  past  decade.  However,  their  performance 
continues  to  fall  short  of  the  efficiency  and  versatility  of  human  behavior.  In 
many  ways,  a  deeper  understanding  of  how  human  perceptual  systems  process 
and  act  upon  physical  sensory  information  can  contribute  to  the  development  of 
better  artificial  systems.  In  the  presented  research,  we  highlight  how  the  latest 
tools  in  computer  vision,  computer  graphics,  and  virtual  reality  technology  can 
be  used  to  systematically  understand  the  factors  that  determine  how  humans 
perform  in  realistic  scenarios  of  complex  task-solving. 

Keywords:  perception  object  recognition,  face  recognition,  eye-movement, 
human-machine  interfaces,  virtual  reality,  biological  cybernetics. 


The  methods  by  which  we  process  sensory  information  and  act  upon  it  comprise  a 
versatile  control  system.  We  are  capable  of  carrying  out  a  multitude  of  complex  op¬ 
erations,  in  spite  of  obvious  limitations  in  our  biological  “hardware”.  These  capabili¬ 
ties  include  our  ability  to  expertly  learn  and  identify  objects  and  people  by  effectively 
navigating  our  eyes  and  body  movements  in  our  visual  environment.  This  talk  will 
present  the  research  perspective  of  the  Biological  Cybernetics  labs  at  the  Max  Planck 
Institute,  Tubingen  and  the  Department  of  Brain  and  Cognitive  Engineering,  Korea 
University,  Key  examples  will  be  drawn  from  our  research  on  face  recognition,  the 
relevance  of  dynamic  information  and  active  vision;  in  order  to  convey  how  percep¬ 
tual  research  can  contribute  towards  the  development  of  better  artificial  systems. 

To  begin,  our  prodigious  ability  to  learn  and  remember  recently  encountered  faces 
-  even  from  only  a  few  instances  -  reflects  a  multi-purpose  pattern  recognition  system 
that  few  artificial  systems  can  rival,  even  with  the  availability  of  3D  range  data.  Unin- 
tuitively,  this  perceptual  expertise  relies  on  fewer,  rather  than  more,  facial  features 
than  state-of-the-art  face-recognition  algorithms  typically  process.  Our  visual  field  of 
high  acuity  is  extremely  limited  (-2°)  and  experimental  studies  indicate  that  we  have 
an  obvious  preference  for  selectively  fixating  the  eyes  and  noses  of  faces  that  we 
inspect  |1|.  These  facial  features  inhabit  a  narrow  bandwidth  of  spatial  frequencies 
(8  to  16  cycles  per  face),  that  face -processing  competencies  are  specialized  for  [2]. 
Therefore,  perceptual  expertise  appears  to  result  from  featural  selectivity,  wherein 
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sparse  coding  by  a  dedicated  system  results  in  expert  discrimination.  The  application 
of  the  same  principles  in  artificial  systems  holds  the  promise  of  improving  automatic 
recognition  performance. 

Self-motion  as  well  as  moving  objects  in  our  environment  dictate  that  we  have  to 
deal  with  a  visual  input  that  is  constantly  changing.  Automated  recognition  systems 
would  often  consider  this  variability  to  be  a  computational  hindrance  that  disrupts  the 
stable  retrieval  of  recognizable  object  features.  Nonetheless,  human  recognition  per¬ 
formance  on  objects  f 3]  and  faces  [4]  is  better  served  by  moving  rather  than  static 
stimuli.  Understanding  why  this  is  so,  could  allow  artificial  recognition  systems  to 
function  equally  well  in  dynamic  environments.  First,  dynamic  presentations  present 
the  opportunity  for  associative  learning  between  familiar  object  views,  which  could 
result  in  object  representations  that  are  robust  to  variations  in  pose  [5,  6].  Further¬ 
more,  dynamic  presentations  could  allow  the  perceptual  system  to  assess  the  stability 
of  different  object  features,  according  to  how  they  tend  to  appear  and  disappear  over 
rigid  rotations.  This  could  offer  a  computationally  cheap  method  for  determining  the 
minimal  set  of  object  views  that  would  be  sufficient  for  robust  recognition  [7,  8]. 
Finally,  characteristic  motion  properties  (e.g..  trajectories,  velocity  profile)  could  even 
serve  as  an  additional  class  of  features  to  complement  a  traditional  reliance  on  image 
and  shape  features  by  automated  recognition  systems  [9,  10]. 

Purposeful  gaze  behavior  indicates  a  perceptual  system  that  is  not  only  capable  of 
processing  information,  but  proficient  in  seeking  out  information,  too.  We  are  capable 
of  extracting  a  scene’s  gist  within  the  first  few  hundred  milliseconds  of  encountering 
it  [11].  In  turn,  this  information  directs  movement  of  our  eyes  and  head  for  the  joint 
purpose  of  fixating  information-rich  regions  across  a  large  field  of  view  [  12].  In  addi¬ 
tion,  we  use  our  hands  to  explore  and  manipulate  objects  so  as  to  access  task- relevant 
information  for  object  learning  or  recognition  [13,  14,  15].  Careful  observations  of 
how  we  interact  with  our  environments  can  identify  behavioral  primitives  that  could 
be  modeled  and  incorporated  into  artificial  systems  as  functional  (and  rc-usable) 
components  [lb].  Furthermore,  understanding  how  eye  and  body  movements 
naturally  coordinate  can  allow  us  improve  the  usability  of  artificial  systems  [17]. 

This  perspective  of  the  perceptual  system  as  an  active  control  system  continues  to 
be  insightful  at  a  higher  level,  when  we  consider  the  human  operator  as  a  controller 
component  in  dynamic  machine  systems.  Take,  for  example,  a  pilot  who  has  to  simul¬ 
taneously  process  visual  and  vestibular  information,  in  order  to  control  helicopter  sta¬ 
bility.  Using  motion  platforms  and  immersive  graphics,  it  is  possible  to  systematically 
identify  the  input  parameters  that  are  directly  relevant  to  a  pilot’s  task  performance  and 
thus,  derive  a  functional  relationship  between  perceptual  inputs  and  performance  out¬ 
put  [18].  Such  research  is  fundamental  for  the  development  of  virtual  environments 
that  are  perceptually  realistic.  This  is  especially  important  when  designing  artificial 
systems  (e.g.,  flight  simulators)  that  are  intended  to  prepare  novices  for  physically 
dangerous  situations  that  are  not  easily  replicable  in  the  real  world  [19]. 

Until  now,  we  have  discussed  how  findings  from  perceptual  research  can  contrib¬ 
ute  towards  improving  artificial  systems.  However,  the  growing  prevalence  of  these 
systems  in  our  daily  environs  raises  an  imperative  to  go  beyond  this  goal.  It  is  crucial 
to  consider  how  perceptual  and  artificial  systems  may  be  integrated  into  a  coherent 
whole  by  considering  the  “human-in-the-loop”.  Doing  so  will  lead  towards  a  new 
generation  of  autonomous  systems  that  will  not  merely  mimic  our  perceptual  compe¬ 
tencies,  but  will  be  able  to  cooperate  with  and  augment  our  natural  capabilities. 
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Relations  are  basic  elements  for  representing  knowledge,  such  as  in  semantic 
network,  logic  and  others.  In  Web  intelligence  research,  the  extraction  or  mining  of 
meaningful  knowledge  and  the  utilization  of  the  knowledge  for  intelligent  services  are 
key  issues.  In  this  talk,  1  will  present  some  of  our  researches  related  to  these  issues, 
ranging  from  macro  relations  to  micro  ones.  Here  we  mostly  use  Web  texts,  and  the 
use  of  their  huge  data  though  a  search  engine  becomes  a  key  function  together  with 
text  analysis. 

The  first  topic  concerns  with  the  extraction  of  human-human  and  eompany- 
eompany  relations  from  the  Web  [1-14].  Relation  types  between  two  entities  are  also 
extracted  here.  An  open  Web  service  based  on  this  function  has  been  operated  in 
Japan  by  a  company.  One  technology  related  to  this  one  is  namesake  disambiguation 
[15-17]. 

Wikipedia  is  a  good  reliable  source  for  wide  knowledge,  unlike  other  Web 
information.  In  order  to  extract  the  knowledge  or  data  from  Wikipedia  in  the  form  that 
computers  can  understand  and  manipulate,  several  attempts  including  ours  [18-23] 
have  been  earned  out,  typically  to  extract  triplets  such  as  (entity,  attribute,  value). 

After  we  worked  on  computing  similarity  between  two  words  based  on  the 
distributional  hypothesis  [24,  25],  we  have  been  interested  in  computing  similarity 
between  two  word  pairs  (or  two  entity  pairs)  [26-28].  Like  in  the  previous  ease,  we 
are  mainly  utilizing  distributional  hypothesis,  and  have  invented  an  efficient 
clustering  method  for  dealing  with  several  tens  of  thousands  of  lexical  patterns. 
Based  on  this  mechanism,  we  have  implemented  a  latent  relational  search  engine, 
which  accepts  two  entity  pairs  with  one  missing  component  such  as  {(Tokyo, 
Japan),  (?,  France)}  as  a  query,  and  produces  an  answer  such  as  (?  =  Paris)  with  its 
evidence.  As  an  extension  of  this  meehanism,  we  recently  invented  an  efficient  co- 
clustering  method,  whieh  works  well  to  find  arbitrary  existing  relations  between 
two  nouns  in  sentences  [29].  This  problem  setting  is  ealled  open  information 
extraction  (open  IE). 
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The  final  topic  of  the  talk  is  Concept  Description  Language  (CDL),  w  hich  has  been 
designed  to  serve  as  a  common  language  for  representing  concept  meaning  expressed 
in  natural  language  texts  [30-32 ].  Unlike  Semantic  Web  which  provtdes  machine- 
readable  meta-data  in  the  form  of  RDF,  CDL  aims  to  encode  the  meaning  of  the 
whole  texts  in  a  machine-understandable  form.  The  basic  representation  element  in 
CDL  is  micro  relations  existing  between  entities  in  the  text;  44  relation  types  are 
defined,  CDL  has  been  discussed  in  a  W3C  incubator  group  for  international 
standardization  since  2007.  It  is  intended  to  be  a  basis  of  semantic  computing  in  next 
generation,  and  also  become  a  medium  for  overcoming  language  barrier  in  the  world. 
Current  issues  of  CDL  are,  among  others,  an  easy  semi-automatic  way  of  converting 
natural  language  texts  into  the  CDL  description,  and  an  effective  mechanism  of 
semantic  retrieval  on  the  CDL  database. 
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Abstract.  We  briefly  describe  here  some  of  the  content  of  a  talk  to  be 
given  at  the  conference. 


1  Introduction 

At  Google,  we  focus  oil  making  information  universally  accessible  through  many 
channels,  including  through  spoken  input.  Since  the  speeeli  group  started  in 
2005  we  have  developed  several  successful  speeeli  recognition  services  for  the 
US  and  for  some  other  countries.  In  2006  we  launched  GOOG-411  in  the  US, 
a  speech  recognition  driven  directory  assistance  serviee  which  works  from  any 
phone.  As  smartphones  like  the  iPhone,  BlackBerry,  Nokia  s60  platform  and 
phones  running  the  Android  operating  system  like  the  Nexus  One  and  others 
becoming  more  widely  used  we  shifted  our  efforts  to  provide  speech  input  for 
the  seareh  engine  (Search  by  Voice)  and  other  applications  on  these  phones. 
Many  recent  smartphones  have  only  soft  keyboards  which  can  be  difficult  to 
type  on,  especially  for  longer  input  words  and  sentences.  Some  Asian  languages, 
for  example  Japanese  and  Chinese  are  more  difficult  to  type  as  the  basic  number 
of  characters  is  very  high  compared  to  Latin  alphabet  languages.  Spoken  input 
is  a  natural  choice  to  improve  on  many  of  these  problems,  and  more  details  are 
discussed  in  the  sections  below. 

We  have  also  been  working  oil  voice  mail  transcription  and  YouTube  tran¬ 
scription  for  US  English,  which  are  also  publically  available  products  in  the  US, 
but  the  focus  here  will  be  on  speech  recognition  in  the  context  of  mobile  devices. 


2  GOOG-411 

GOOG-411  is  Google’s  speech  recognition  based  directory  assistance  serviee  op¬ 
erating  in  the  US  and  Canada  [1],  [2].  This  application  uses  a  toll-free  number, 
1-8GO-GOOG-411  (1-800-4664-411).  The  user  is  prompted  to  say  city,  state  and 
the  name  of  the  business  s(lie)  is  looking  for.  Using  text-to-speech  the  serviee  can 
give  address  and  phone  number,  or  ean  eomieet  the  user  directly  to  the  business. 
As  baekend  information  from  Google  Maps  Loeal  is  used. 

While  this  is  a  useful  application  to  search  for  restaurants,  stores  etc.  it  is 
limited  to  businesses.  Other  difficulties  with  this  kind  of  service'  include  the 
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necessity  of  a  dialog,  relatively  expensive  operating  costs,  listing  errors  in  the 
backend  database,  and  most  importantly  to  not  be  able  to  give  richer  information 
(as  on  a  smartphone  screen)  back  to  the  user. 

3  Voice  Search 

In  2008  Google  launched  Voice  Search  in  the  US  for  several  types  of  smartphones 
[3],  Voice  Search  adds  simply  the  ability  to  speak  a  search  query  to  the  phone 
instead  of  having  to  type  it  into  the  browser.  The  audio  is  sent  to  Google  servers 
where  it  is  recognized  arid  the  recognition  result  along  with  the  search  result 
is  sent  back  to  the  phone.  The  data  goes  over  the  data  channel  instead  of  the 
voice  channel  which  allows  higher  quality  audio  transmission  and  therefore  better 
recognition  rates.  Our  speech  recognition  technology  is  relatively  standard,  below 
some  details. 

Front-End  and  Acoustic  Model.  For  the  front-end  we  use  39-dimensional 
PLP  features  with  LDA.  The  acoustic  models  are  ML  and  MMI  trained,  triphone 
decision-tree  tied  3-state  HMMs  with  currently  up  to  10k  states  total.  The  state 
distributions  are  modeled  by  50-300k  diagonal  covariance  Gaussians  with  STO, 
We  use  a  time-synchronous  finite-state  transducer  (FST)  decoder  with  Gaussian 
selection  for  speedy  likelihood  calculation. 

Dictionary.  Our  phone  set  contains  between  30  and  100  phones  depending  on 
the  language.  We  use  between  200k  and  1.5M  words  in  the  dictionary,  which 
are  automatically  extracted  from  the  wet)- based  query  stream.  The  pronuncia¬ 
tions  for  these  words  are  mostly  generated  by  an  automatic  system  with  special 
treatment  for  numbers,  abbreviations  and  other  exceptions. 

Language  Model.  As  our  goal  is  to  recognize  search  queries  we  mine  our 
language  model  data  from  web- based  anonymous  search  queries.  We  mostly  use 
3-grams  or  5-grams  with  Katz  backoff  trained  on  months  or  years  of  query  data. 
The  language  models  have  to  be  pruned  appropriately  such  that  the  final  decoder 
graphs  fit  into  memory  of  the  servers. 

Acoustic  Data.  To  train  an  initial  system  we  collect  roughly  250k  of  spoken 
queries  using  an  Android  application  specifically  designed  for  this  purpose  [4). 
Several  hundred  speakers  read  queries  off  a  screen  and  the  corresponding  voice 
samples  are  recorded.  As  most  queries  are  spoken  without  errors  we  don't  have 
to  marmallv  transcribe  these  queries. 

Metrics.  We  want  to  optimize  user  experience.  Traditionally  speech  recognition 
systems  focus  on  minimizing  word  error  rate.  This  is  also  a  useful  measure  for  us, 
but  better  is  a  normalized  sentence  error  rate  as  it  doesn’t  depend  as  much  on 
the  definition  of  a  word.  As  the  metric  which  approximates  user  experience  best 
we  use  WebSrore:  We  send  hypothesis  and  reference  to  a  search  backend  and 
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compare  the  links  we  get  baek.  Assuming  that  the  reference  generates  the  eorreet 
seareli  result  this  way  we  know  whether  the  search  result  for  the  hypothesis  is 
within  the  first  three  results  such  that  the  user  ean  see  the  correct  result  on 
his  smartphone  screen. 

Languages.  After  US  English  we  launched  Voice  Search  for  the  UK,  Australia 
arid  India.  Late  2009  Mandarin  Chinese  [5]  and  Japanese  were  added.  Foreign 
languages  pose  many  additional  challenges.  For  example,  some  Asian  languages 
like  Japanese  and  Chinese  don’t  have  spaces  between  words.  For  these  we  wrote  a 
segment er  which  optimizes  the  word  definitions  maximizing  sentence  likelihood. 
Most  languages  have  characters  outside  the  normal  ASCII  set.  in  some  cases 
thousands,  which  complicate  automatic  pronunciation  rules. 

Additional  Challenges.  There  are  many  details  which  are  critical  to  get  right 
for  a  good  user  experience  but  we  cannot  discuss  here  because  of  space  con¬ 
straints.  These  include  getting  the  user  interface  right,  optimizing  protocols  for 
minimum  latency,  dealing  with  speeial  eases  like  numbers,  dates  and  abbre¬ 
viations  correctly,  avoid  showing  offensive  queries  and  improving  the  system 
efficiently  after  launch  using  the  data  coming  in. 

4  Outlook 

For  mobile  devices  speech  is  an  attractive  input  modality  and  besides  Voice 
Search  we  have  been  working  on  other  features,  including  moor  general  Voice 
Input  [6],  contact  dailing  (as  launched  in  the  US)  and  recognition  of  special 
phrases  to  trigger  certain  applications  on  the  phone.  We  believe  that  in  the 
next  few  years  speech  input  will  become  more  accurate,  more  accepted  and 
useful  enough  to  help  users  efficiently  access  and  navigate  through  information 
provided  through  mobile  devices. 
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Abstract.  Symmetry  can  be  used  to  help  solve  many  problems.  For  instance, 
Einstein's  famous  1905  paper  ("On  the  Electrodynamics  of  Moving  Bodies") 
uses  symmetry  to  help  derive  the  laws  of  special  relativity.  In  artificial  intelli 
gence,  symmetry  has  played  an  important  role  in  both  problem  representation 
and  reasoning.  I  describe  recent  work  on  using  symmetry  to  help  solve  constraint 
satisfaction  problems.  Symmetries  occur  within  individual  solutions  of  problems 
as  well  as  between  different  solutions  of  the  same  problem.  Symmetry  can  also 
be  applied  to  the  constraints  in  a  problem  to  give  new  symmetric  constraints. 
Reasoning  about  symmetry  can  speed  up  problem  solving,  and  has  led  to  the 
discovery  of  new  results  in  hoth  graph  and  numher  theory. 


1  Introduction 

Symmetry  occurs  in  many  combinatorial  search  problems.  For  example,  in  the  magic 
squares  problem  (prob()19  in  CSPLib  [I]),  we  have  the  symmetries  that  rotate  and  re¬ 
flect  the  square.  Eliminating  such  symmetry  from  the  search  space  is  often  critical  when 
try  ing  to  solve  large  instances  of  a  problem.  Sy  mmetry  can  occur  both  within  a  single 
solution  as  well  as  between  different  solutions  of  a  problem.  We  can  also  apply  symme¬ 
try  to  the  constraints  in  a  problem.  We  focus  here  on  constraint  satisfaction  problems, 
though  there  has  been  interesting  work  on  symmetry  in  other  types  of  problems  (c.g. 
planning,  and  model  checking).  We  summarize  recent  work  appearing  in  [2,3,4]. 


2  Symmetry  between  Solutions 

A  symmetry  a  is  a  bijeetton  on  assignments.  Given  a  set  of  assignments  A  and  a  sym¬ 
metry  <7,  we  w  rite  rr(/l)  for  {<r(«)  |  o  .  £  .4}.  A  special  type  of  symmetry,  called  solution 
symmetry  is  a  symmetry  between  the  solutions  of  a  problem.  More  formally,  we  say  that 
a  problem  has  the  solution  symmetry  rr  iff  a  of  any  solution  is  itself  a  solution  (5]. 

Running  example:  The  magic  squares  problem  is  to  label  o  n  by  n  sepia  re  so  that  the 
sum  of  every  row ;  column  and  diagonal  are  ecjual  (prob019  in  CSPLib  [ J I ).  A  normal 
magic  square  contains  the  integers  1  to  u2.  We  model  this  with  n2  variables  XlyJ  where 
Xi,j  =  k  iff  the  ith  column  and  jth  row  is  labelled  with  the  integer  k. 

*  Supported  by  the  Australian  Government's  Department  of  Broadband,  Communications  and 
the  Digital  Economy  and  the  ARC.  Thanks  to  the  co-authors  of  the  work  summarized  here: 
Marijn  Hcule,  George  Katsirelos  and  Nina  Narodytska. 
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“ Lc )  Shu  ”,  the  smallest  nan-trivial  normal  magic  square  has  been  known  for  over 
four  thousand  years  and  is  an  important  object  in  ancient  Chinese  mathematics: 
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The  magic  squares  problem  has  a  number  of  solution  symmetries.  For  example,  consider 
the  symmetry  Od  that  reflects  a  solution  in  the  leading  diagonal.  This  map  “ Lo  Shu” 
onto  a  symmetric  solution: 
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Any  other  rotation  or  reflection  of  the  square  maps  one  solution  onto  another.  The  8 
symmetries  of  the  square  are  thus  all  solution  symmetries  of  this  problem.  In  fact ,  there 
are  only  8  different  magic  square  of  order  3,  and  all  are  in  the  same  symmetry  class. 

One  way  to  factor  solution  symmetry  out  of  the  search  space  is  to  post  symmetry  break¬ 
ing  constraints.  See,  for  instance,  [6,7,8,9.10,1 1,12,13,14].  For  example,  we  can  elimi¬ 
nate  ad  by  posting  a  constraint  which  ensures  that  the  top  left  corner  is  smaller  than  its 
symmetry,  the  bottom  right  corner.  This  selects  (1)  and  eliminates  (2).  Symmetry  can 
be  used  to  transform  such  symmetry  breaking  constraints  [2].  For  example,  if  we  apply 
(Td  to  the  constraint  which  ensures  that  the  top  left  corner  is  smaller  than  the  bottom 
right,  we  get  a  new  symmetry  breaking  constraints  which  ensures  that  the  bottom  right 
is  smaller  than  the  top  left.  This  selects  (2)  and  eliminates  (1). 


3  Symmetry  within  a  Solution 


Symmetries  can  also  be  found  within  individual  solutions  of  a  constraint  satisfaction 
problem.  We  say  that  a  solution  A  contains  the  internal  symmetry  o  (or  equivalently  o 
is  a  internal  symmetry  within  this  solution)  iff  o(A)  =  A. 

Running  example:  Consider  again  44 Lo  Shu”.  This  contains  an  internal  symmetry. 
To  see  this ,  consider  the  solution  symmetry  a *m,  that  inverts  labels,  mapping  k  onto 
n2  T  1  —  k.  This  solution  symmetry  maps  uLo  Shu  ”  onto  a  different  (but  symmetric) 
solution.  However,  if  we  now  apply  the  solution  symmetry  rr18o  that  rotates  the  square 
180°,  we  map  back  onto  the  original  solution: 


Consider  the  composition  of  these  two  symmetries:  <7inv  o  <T]8q.  As  this  maps  44 Lo  Shu  ” 
onto  itself  the  solution  11  Lo  Shu”  contains  the  internal  symmetry  Oinv  o  <7  iso- 

1  n  general,  there  is  no  relationship  between  the  solution  symmetries  of  a  problem  and  the 
internal  symmetries  within  a  solution  of  that  problem.  There  are  solution  symmetries  of  a 
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problem  which  are  not  internal  symmetries  within  any  solution  of  that  problem,  and  vice 
versa.  However,  w  hen  all  solutions  of  a  problem  contain  the  same  internal  symmetry,  we 
can  be  sure  that  this  is  a  solution  symmetry  of  the  problem  itself.  The  exploitation  of  in¬ 
ternal  symmetries  involves  two  steps:  tindmg  internal  symmetries,  and  then  restricting 
search  to  solutions  containing  just  these  internal  symmetries.  We  have  explored  this  idea 
in  two  applications  where  we  have  been  able  to  extend  the  state  of  the  art.  In  the  first,  we 
found  new  lower  bound  certificates  for  Van  der  Waerden  numbers.  Such  numbers  are  an 
important  concept  in  Ramsey  theory.  In  the  second  application,  wc  increased  the  si/e  of 
graceful  labellings  known  for  a  family  of  graphs.  Graceful  labelling  has  practical  appli¬ 
cations  in  areas  like  communication  theory.  Before  our  work,  the  largest  double  wheel 
graph  that  we  found  graceful  labelled  in  the  literature  had  size  10,  Using  our  method, 
wc  constructed  the  first  known  labelling  for  a  double  wheel  of  size  24. 
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Abstract.  Ordinal  conditional  function  (OCF)  frameworks  have  been 
successfully  used  for  modeling  belief  revision  when  agents1  beliefs  arc 
represented  in  the  propositional  logic  framework.  This  paper  addresses 
the  problem  of  belief  change  of  graphical  representations  of  uncertain  in¬ 
formation,  called  OCF-based  networks.  In  particular,  it  addresses  how  to 
revise  OCF- based  networks  in  presence  of  sequences  of  observations  and 
interventions.  This  paper  contains  three  contributions:  Firstly,  we  show 
that  the  well-known  mutilation  and  augmentation  methods  for  handling 
interventions  proposed  in  the  framework  of  probabilistic  causal  graphs 
have  natural  counterparts  in  OCF  causal  networks.  Secondly,  we  provide 
an  OCF-based  counterpart  of  an  efficient  method  for  handling  sequences 
of  interventions  and  observations  by  directly  performing  equivalent  trans¬ 
formations  on  the  initial  OCF  graph.  Finally,  wre  highlight  the  use  of 
OCF-based  causal  networks  on  the  alert  correlation  problem  in  intrusion 
detection. 

Keywords:  OCF-based  networks,  belief  change,  causal  reasoning,  alert 
correlation. 


1  Introduction 

Among  the  powerful  frameworks  for  representing  uncertain  pieces  of  informa¬ 
tion,  ordinal  conditional  functions  (OCF)  [12]  is  an  ordinal  setting  that  has 
been  successfully  used  for  modeling  revision  of  agents'  beliefs  [4].  OCFs  are  very 
useful  for  representing  uncertainty  and  several  works  point  out  their  relevance 
for  representing  agents’  beliefs  and  defining  belief  change  operators  for  updating 
the  current  beliefs  in  the  light  of  new  information  [9].  OCF-based  networks  (also 
called  kappa- networks)  [7]  are  graphical  models  [8]  expressing  the  beliefs  using 
OCF  ranking  functions.  The  graphical  component  allows  an  easy  and  compact 
representation  of  influence  relationships  existing  between  the  domain  variables 
while  OCFs  allow  an  easy  quantification  of  belief  strengths.  OCF-based  networks 
are  less  demanding  than  probabilistic  networks  (where  exact  probability  degrees 
are  needed).  In  OCF-based  networks,  belief  strengths,  called  degrees  of  surprise, 
may  be  regarded  as  order  of  magnitude  probability  estimates  which  makes  easier 
the  elicitation  of  agents’  beliefs. 
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Causality  is  an  important  notion  in  many  applications  such  as  diagnosis,  ex¬ 
planation.  simulation,  etc.  Then'  are  several  recent  approaches  and  frameworks 
addressing  causality  issues  in  several  areas  of  artificial  intelligence.  Among  these 
formalisms,  causal  graphical  models  (such  as  causal  Bayesian  graphs  [8]  and 
possibilistic  networks  [2])  offer  efficient  tools  for  causal  ascription.  While  OCF 
frameworks  have  been  extensively  used  for  studying  default  reasoning  and  be¬ 
lief  revision,  there  are  only  few  works  addressing  belief  change  in  OCF- based 
networks  while  causality  issues  have  not  yet  been  investigated. 

Observations  are  often  handled  using  a  simple  form  of  conditioning  and  the 
order  in  which  they  are  reported  does  not  matter.  The  situation  is  clearly  dif¬ 
ferent  in  the  presence  of  both  interventions  and  observations.  Let  us  consider 
an  example  in  the  intrusion  detection  field  Assume  that  for  the  network  ad¬ 
ministrator,  the  most  common  situation  is  that  the  Web  server  works  normally 
and  in  case  where  this  latter  works  abnormally  or  crashes,  it  is  mostly  due  to 
flooding  denial  of  service  attacks  DoS1  launched  by  attackers.  Now,  if  one  day, 
the  administrator  observes  that  his  server  works  abnormally,  then  after  this  ob¬ 
servation,  any  other  external  action  causing  his  Web  server  crash  will  not  change 
his  beliefs  regarding  the  fact  that  a  DoS  attack  is  being  undertaken.  Consider 
now  the  converse  situation  where  just  before  looking  at  the  alert  log  file  (in  or¬ 
der  to  check  whether  DoS  attacks  were  detected),  wc  perform  a  manipulation 
that  crashes  the  Web  server2.  Then  after  this  intervention,  without  surprise  the 
administrator  observes  that  his  server  crashes  blit  he  will  not  change  his  a  priori 
beliefs  concerning  the  fact  that  there  is  no  attack  which  is  currently  undergoing. 
Here,  an  observation  followed  by  an  intervention  does  not  give  the  same  result 
as  an  intervention  followed  by  an  observation.  This  paper  contains  three  main 
cont  ributions: 

—  Firstly,  we  show  that  the  well-known  mutilation  and  augmentation  methods 
11]  for  handling  interventions  proposed  in  the  framework  of  probabilistic 
causal  graphs  have  natural  counterparts  in  OCF-based  networks. 

Secondly,  we  propose  an  OCF-based  counterpart  of  an  efficient  method  [3j  for 
handling  sequences  of  interventions  and  observations  by  directly  performing 
equivalent  transformations  on  the  causal  graph. 

Filially,  we  highlight  the  interest  of  reasoning  with  sequences  of  observations 
and  interventions  on  alert  correlation,  a  major  problem  in  computer  security. 

Let  us  first  provide  basic  backgrounds  on  OCF1  networks. 

2  A  Brief  Refresher  on  OCF-Based  Networks 

Ordinal  conditional  functions  (OCFs)  [12]  is  an  ordinal  framework  for  represent¬ 
ing  and  changing  agents  beliefs.  In  the  following,  lr={  A\  A\ ,  ^2? An}  denotes 
the  set  of  variables.  Da,  denotes  the  domain  of  a  variable  At  and  a,  a  possible  in¬ 
stance  of  At.  f2=x  Atev Dai  denotes  the  universe  of  discourse.  An  interpretation 


1  Attacks  which  overwhelm  servers  with  huge  number  of  requests. 

2  For  instance,  using  a  bad  configuration  of  an  application  on  the  server  machine. 
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«;=(«!,  «2, ...  <xn)  is  an  instance  of  4?  while  w[Aj]  denotes  the  value  of  variable 
Ai  in  w.  0,  (f  denote  subsets  of  42,  called  events. 

An  OCF  (also  called  a  ranking  or  kappa  function)  denoted  k  is  a  mapping 
from  the  universe  of  discourse  4?  to  the  set  of  ordinals  (here,  we  assume  to  a 
set  of  integers)  [C] .  n(wj)  is  called  a  disbelief  degree  (or  degree  of  surprise).  By 
convention  k(w{)=  0  means  that  wt  is  not  surprising  and  corresponds  to  a  nor¬ 
mal  state  of  affairs  while  k(w1)=o c  denotes  an  implausible  event.  The  relation 
n(Wi)<n(wj)  means  that  wt  is  more  plausible  than  Wj.  The  function  n  is  normal¬ 
ized  if  there  exists  at  least  one  possible  interpretation  wEf?  such  that  k(w)— 0. 
The  disbelief  degree  k(0)  of  an  arbitrary  event  0C 47  is  defined  as  follows: 


n((j))  =  umi(n(‘Wi)). 

Wi  €  0 


(i) 


Conditioning  is  a  fundamental  notion  for  updating  a  priori  beliefs  when  a  new 
evidence  (a  completely  sure  event)  arrives.  It  is  defined  as  follows  (we  assume 
that  k(0)^oo): 


n(wi)  -  n(<t>)  if  wt  6  <t>\ 
oc  otherwise. 


(2) 


The  effect  of  conditioning  is  to  exclude  every  interpretation  Wi  which  does 
not  satisfy  the  evidence  0  while  the  the  other  interpretations  are  decreased 
by  k(0).  In  particular,  the  most  plausible  interpretation  satisfying  0  (namely, 
Wj=argminWie<j,(K(u)i)))  is  assigned  0. 


2.1  Causal  OCF- Based  Networks 

Graphical  models  such  as  probabilistic  networks  [8]  are  well-known  and  efficient 
modeling  and  reasoning  tools.  Like  Bayesian  networks,  OCF-based  ones  consist 
of  two  components:  i)  A  graphical  component  consisting  in  a  directed  acyclic 
graph  (DAG)  where  the  nodes  denote  the  domain  variables  and  arcs  encode 
direct  influence  relations  existing  between  these  variables,  and  ii)  A  numerical 
component  composed  of  a  set  of  conditional  ranking  functions  weighting  the 
influence  endured  by  each  variable  Ax  in  the  context  of  its  parents  Ua,> 

The  normalization  condition  requires  that  every  local  ranking  function  should 
satisfy  the  following  condition: 

min  (n((ii\uA,))  =  0-  (3) 

Oi  €/J/q 

The  joint  ranking  function  encoded  by  a  network  G  is  computed  as  follows: 

n 

,»,«»)  =  (4) 

1=1 

In  a  causal  OCF-based  network,  the  graph  only  encodes  causal  (cause-e fleet) 
relationships.  Hence,  in  a  causal  OCF-network.  the  parent  set  of  a  node  At 
represents  all  the  direct  causes  of  At.  The  following  example  will  be  used  in  the 
rest  of  this  paper  to  illustrate  our  contributions: 
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This  example  is  about,  mechanics  where  we  are  only  interested  in  the  car  startup 
problem.  We  define  the  following  variables: 

—  S  (for  Start)  taking  its  values  in  the  domain  D$={Yesi  No }. 

-  B  (for  Battery)  taking  its  values  in  D  u= {Charged,  Discharged) . 

F  (for  Fuel)  taking  its  values  in  Dp—  [Empty.  Not  Empty}  where  the  value 
Empty  denotes  an  empty  fuel  tank  while  Not  Empty  denotes  a  non  empty 
fuel  tank. 

//  (for  Headlights)  taking  its  values  in  the  domain  Du  {On,  Off}  where 
the  value  On  denotes  that  the  headlights  were  left  switched  on  overnight  and 
Off  denotes  the  fact  that  the  headlights  were  left  switched  off  overnight. 

The  OCF-based  network  representing  the  car  startup  problem  is  given  in 
Figure  1.  For  instance,  for  the  fuel  variable  F,  the  most  common  state  is  that 
the  fuel  tank  is  not  empty  while  the  state  Empty  is  exceptional.  Similarly.  Off 
is  the  most  common  state  for  the  headlights  variable  IF  Regarding  the  variable 
B ,  if  the  headlight  were  left  switched  on  overnight,  then  the  value  Discharged 
is  the  most  common  state  for  variable  B.  Lastly,  if  the  battery  is  discharged  or 
the  fuel  tank  is  empty,  then  the  most  plausible  state  for  the  start  variable  S  is 
No  (the  car  does  not  start). 


tj  1  1 

Owged 

at  ' 

“o 

Discharged 

Off 

8 

Charged 

On 

4 

Discharged 

2nJ 

0 

1 

Yes 

Charged 

Not  Cmpfy' 

0 

No 

Charged 

Not  Cmpty 

6 

Yes 

Charged 

Cmpty 

15 

No 

Charged 

Fmpty 

0 

Yes 

Discharged 

Not  fcmpty 

12 

No 

Discharged 

Not  empty 

0 

Yes 

Discharged 

Empty 

50 

No 

Discharged 

Empty 

0 

Fig.  1.  The  OCF-based  network  of  the  car  startup  problem 


3  Handling  Interventions  in  OCF  Causal  Networks 

Interventions  [1 1]  constitute  a  fundamental  notion  for  causality  ascription  as  they 
provide  a  natural  way  for  understanding  causation.  Indeed,  causal  relationships 
are  more  easily  identified  if  one  can  directly  intervene  on  the  system  (as  an 
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experimenter)  and  evaluate  the  effects  of  such  manipulations.  An  intervention  is 
the  action  of  forcing  a  variable  to  a  specific  value.  It  is  important  to  note  that 
an  intervention  is  due  to  something  outside  the  considered  system  and  it  does 
not  matter  how  the  intervention  happens.  In  the  example  of  Figure  I,  we  can  for 
instance  remove  the  spark  plugs  in  order  to  prevent  the  car  engine  from  starting 
even  if  the  battery  is  charged  and  the  fuel  tank  is  not  empty.  In  causal  networks, 
an  intervention  on  a  variable  must  not  change  our  beliefs  (expressed  in  some 
uncertainty  framework)  on  parents  Ua,  of  A,.  There  are  mainly  two  equivalent 
methods  for  handling  interventions  in  causal  graphical  models:  graph  mutilation 
proposed  by  Pearl  and  Verma  in  [13]  and  graph  augmentation  proposed  in  [10]  by 
Pearl.  In  2],  the  authors  proposed  possibilistic  counterparts  for  the  mutilation 
and  augmentation  methods.  In  the  following,  we  propose  the  counterparts  of 
these  methods  for  OCF-based  networks. 

3.1  Handling  Interventions  by  Mutilating  the  OCF  Causal  Network 

Let  G  be  an  initial  OCF-based  network.  An  intervention  on  a  variable  At  de¬ 
noted  do(al)  ensures  that  our  beliefs  on  Ua1  (the  set  of  parents  of  Aj)  are  not 
affected.  In  the  mutilation  method,  this  is  achieved  by  removing  all  the  arcs 
from  each  variable  composing  Ua,  to  Aj  while  maintaining  the  rest  of  the  graph 
unmodified.  The  obtained  graph  is  called  the  mutilated  graph  and  denoted  Gjn 
such  that  Ka{w\do(ai))=KGni{u'\al),  where  kg  (resp.  Kcm )  is  the  joint  ranking 
function  encoded  by  G  (resp.  Gm).  In  order  to  determine  the  effect  of  the  inter¬ 
vention  do(at)  on  the  rest  of  the  initial  graph  G\  one  can  apply  conditioning  on 
the  mutilated  graph  Gm  after  having  observed  the  event  Ai—a.i.  This  result  is 
formalized  in  the  following  proposition: 

Proposition  1.  Let  G  be  an  OCF-based  causal  network  and  kg  the  joint  rank¬ 
ing  function  encoded  by  G.  Lot  Gm  be  the  mutilated  graph  obtained  after 
handling  an  intervention  do(ai)  and  KGm  the  joint  ranking  function  encoded 
by  Gm.  Let  also  k.gu  denote  the  joint  ranking  function  obtained  by  condi¬ 
tioning  kg  with  do(ai).  Then  VwGf?,  KG{w\do(ai))=Kan  (w)=KGm  (w|fl-i). 


3.2  Handling  Interventions  by  Augmenting  the  OCF  Causal 
Network 

The  principle  of  the  augmentation  method  [10]  is  to  consider  an  intervention  as 
an  extra  node  in  the  system.  Then  a  parent  node  denoted  DO{  is  added  to  the 
node  Ai  under  intervention.  Hence,  the  parents  set  of  the  variable  A,  (i.e.  Ua,) 
is  augmented  by  the  extra  node  DOj  allowing  to  specify  the  behavior  of  the 
variable  A\.  The  domain  of  DOi  is  {{doai;V  a^D a{} ,doi-noact}  where  the  value 
dOi-noact  means  that  no  intervention  is  performed  on  A1  while  doax  means  that 
the  variable  Ai  is  forced  to  take  the  value  a*.  The  obtained  augmented  network 
is  denoted  Ga  such  that  KG{w\do(ai))=KGa{w\DOi—doat ),  where  kg  (resp. 
is  the  joint  ranking  function  encoded  by  G  (resp.  Ga). 
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Proposition  2.  Let  G  be  an  OCF-based  causal  network  and  kq  the  joint  rank¬ 
ing  function  encoded  by  G.  Let  Ga  be  the  augmented  graph  for  handling  an 
intervention  do((ii)  by  adding  the  node  DOi.  Let  U'A  =U/\tL)DOi  and  u'A 
be  an  instance  of  Dy^  .  Gn  is  such  that  every  variable  Aj  different  from  A, 
has  the  same  local  ranking  function  as  in  G  and 

f  0  if  DOi  =  doa, 

=  j  I'M  A,  )  if  DO |  —  dOi  —  noacl  (b) 

[  oc  otherwise 

Then  KG{u'\do(at))=Kaa(w\DO,  =  doa. ). 


4  Handling  Sequences  of  Intervent  ions/ Observations 

Contrary  to  the  handling  of  a  sequence  involving  only  observations  or  only  in¬ 
terventions,  handling  sequences  involving  both  observations  and  interventions 
should  be  done  differently  depending  on  the  order  in  which  observations  and  in¬ 
terventions  occur.  More  particularly,  given  an  OCF-based  network  encoding  the 
initial  beliefs,  there  might  exist  situations  where  the  revised  beliefs  after  having 
an  observation  followed  by  an  intervention  will  not  be  the  same  as  if  we  have 
first  the  intervention  preceding  the  observation.  In  order  to  illustrate  this  issue, 
consider  the  following  two  scenarios  on  the  example  of  Figure  1: 

Example  (Continued) 

1.  Scenario  1  (An  observation  preceding  an  intervention):  Assume  that 
one  morning,  the  car  does  not  start.  Wo  change  our  a  priori  beliefs  (the  bat¬ 
tery  is  working  (charged),  the  fuel  tank  is  not  empty  and  the  car  headlights 
were  not  left  switched  on  overnight).  According  to  the  beliefs  encoded  by 
the  network  of  Figure  1,  we  deduce  that  either  the  battery  is  discharged  or 
the  fuel  tank  is  empty.  After  this  observation,  assume  an  intervention  pre¬ 
venting  the  car  from  starting  (for  example,  removing  a  spark  plug).  Clearly, 
after  this  intervention,  we  will  not  change  our  beliefs  regarding  the  battery 
and  the  fuel  tank. 

2.  Scenario  2  (An  intervention  preceding  an  observation):  \ssurue  in 
this  scenario  that  before  trying  to  start  the  car,  we  first  remove  a  spark 
plug.  Unsurprisingly,  the  car  does  not  start.  Knowing  that  a  plug  spark 
was  removed,  it  is  dear  that  the  fact  that  the  car  does  not  start  is  due 
to  the  intervention.  Consequently,  the  most;  plausible  state  (according  to 
Figure  1)  is  that  the  battery  is  Charged  and  lie  fuel  tank  is  Not  Empty  arid 
the  headlights  were  left  switched  Off  overnight,  namely  the  initial  beliefs 
before  any  intervention  or  observation. 

Clearly,  these  scenarios  show  that  the  order  of  occurrence  of  observations  and 
interventions  should  be  taken  into  account.  However,  existing  approaches  [11]  [2] 
confuse  the  notions  of  observations  and  interventions  and  do  not  explicitly  distin¬ 
guish  between  the  two  scenarios.  The  following  section  presents  the  OCF-based 
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counterpart  of  an  efficient  method  for  handling  sequences  of  observations  and 
interventions  proposed  in  [3]  (resp.  in  [1])  in  the  context  of  min-based  (resp. 
product-based)  causal  possibilistic  networks. 

4.1  Graphical  Handling  of  Sequences  of  Both  Interventions  and 
Observations  in  Causal  OCF-Based  Networks 

Our  method  views  each  observation  Ai=a,i  or  intervention  do(at)  as  a  belief 
change  process  that  transforms  an  initial  ranking  function  k  (associated  with 
some  OCF-based  network)  into  a  new  distribution  K(.\Aj=ai)  or  K(.\do(aj)). 
Hence,  it  is  enough  to  build  an  OCF-based  network  associated  with  k(.| Ai=ai) 
and  Av(.|do(al)).  While  the  handling  of  interventions  is  straightforward  in  causal 
networks  by  mutilating  the  graph  (or  equivalently  by  augmenting  the  graph), 
handling  graphically  observations  needs  more  operations.  In  the  following  we 
propose  a  graphical  counterpart  for  the  conditioning  operation  for  handling  ob¬ 
servations  in  causal  OCF-based  networks.  We  rest  rict  ourself  to  OCF-based  net¬ 
works  where  DAG  s  are  trees,  where  a  node  can  have  at  most  one  parent. 

4.2  Graphical  Counterpart  of  Conditioning  for  Handling 
Observations 

In  order  to  perform  conditioning  directly  on  the  graph,  conditioning  is  viewed  as 
a  sequence  of  two  operations:  i)  A  combination  operation  (which  combines  the 
original  ranking  function  with  the  one  associated  with  the  observation  At=ai). 
and  ii)  A  normalization  operation  (for  normalizing  the  ranking  function  obtained 
after  the  combination  step  in  case  where  this  latter  becomes  sub-normalized). 
To  make  this  decomposition  clear,  let  G  be  an  OCF-based  network  and  kq  be 
the  ranking  function  encoded  by  G  (kq  is  obtained  form  G  using  the  chain  rule 
of  Equation  4).  In  order  to  perform  the  combination  operation,  let  us  define  the 
local  ranking  function  associated  with  the  observation  as  follows: 


Combining  the  initial  ranking  function  kq  with  KAt=ai  can  be  defined  as  follows: 


\fw  G  47,  «G2(w)  =  Kg{w)  +  «At=a,(w)- 


(7) 


The  ranking  function  t%G2  is  obtained  from  kg  by  considering  as  completely 
impossible  every  interpretation  w  where  the  value  of  At  is  different  from  at 
(namely,  Vtc€47  kg2(w)=oo  if  w[Ai]^a,i),  and  preserving  unchanged  the  disbelief 
degrees  of  all  interpretations  w  where  the  value  of  At  is  After  this  step,  Kc,2 
may  be  sub- normalized.  Let  us  define  the  normalization  operation  as  follows: 


Vie  6  17, KG3(w)  =  k<72(w)  —  min  kg2(w)- 


(8) 


Hence,  using  the  combination  and  normalization  formulas  (see  Equations  7  and 
8),  the  conditioning  given  by  Equation  (2)  can  be  redefined  as  follows: 


Vte  e  47,  KC(w\Ai  =  «*)  =  kg3(w). 


(9) 
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Let  us  now  provide  the  graphical  counterparts  of  combination  and  normalization 
operations. 

Graphical  Counterpart  of  the  Combination  Operation.  Let  ns  use  G 2 
to  denote  the  result  of  integrating  the  new  observation  A*=«i  in  the  network  G. 
namely  the  network  associated  with  the  ranking  function  given  by  Equation  7. 
GT2  is  specified  as  follows: 

Proposition  3.  The  OCF-based  network  G 2  associated  with  the  ranking  func¬ 
tion  given  by  Equation  7  is  obtained  form  network  G  as  follows: 

the  structure  of  G 2  is  obtained  from  the  DAG  of  G  by  deleting  the  arc 
from  the  parents  of  Aj  to  A,-. 

the  local  function  of  any  variable  Aj  in  <72  different  from  Aj  and  U^i 
is  identical  to  Ay s  local  function  in  G.  Regarding,  A,  and  its  parent 
denoted  D.  the  new  local  ranking  functions  are  defined  as  follows: 


•  VaieD/\l , 


•  Let  C  be  the  parent  of  D  in  G,  then  Vr/jGD/j,  VcjGDc  k Gi(<k\cj)= 

KG(di\cj)+Kc{ai\(1i) 

The  new  local  ranking  function  relative  to  the  variable  At  ensures  that  only  the 
instance  a,  is  fully  accepted  and  all  the  other  instances  are  completely  implausi¬ 
ble.  Note  that  contrary  to  handling  interventions,  the  ranking  function  relative 
to  variable  D  (parent  of  A,)  is  altered  in  order  to  ensure  that  disbelief  degrees  of 
every  interpretation  w  satisfying  a,  are  identical  in  kc  and  kg  2 ■  Hence,  since  the 
value  of  the  variable  of  At  is  now  fully  determined,  there  is  no  need  to  maintain 
the  arc  from  the  parent  of  A,  (here  D )  to  Aj.  One  can  easily  check  that  V/r£i?. 
KC,2 ( U')-HG (  W)  +  K A;  =o ,  ( «’) ' 

Example  (Continued) 

We  continue  with  the  example  of  Figure  1  but  restricted  to  a  tree  by  discarding 
node  B  ( Battery  variable)  and  //  {Headlights  variable).  Figure  2  gives  the  initial 
network  G  and  G2  obtained  after  combining  G  with  the  observation  S=No. 

As  for  node  F  of  network  G2  of  Figure  2,  the  new  ranking  function  of  the  parent 
of  the  observed  variable  may  be  sub- normalized,  t  he  following  step  deals  with 
this  problem. 

Graphical  Counterpart  of  the  Normalization  Operation.  After  the  com¬ 
bination  step,  the  ranking  function  relative  to  the  parent  variable  (denoted 
here  D)  of  the  observed  one  (here  A,-)  may  be  sub-normalized.  Namely,  it 
may  exists  an  instance  cj  of  the  parent  variable  of  D  denoted  C  sueli  that 
inh\(tjeDD{KG2(<h\Cj))=fi  (/i>0).  We  want  to  compute  a  new  OCF  network,  de¬ 
noted  G3,  such  that  it  satisfies  Equation  8.  The  network  G3  is  constructed  such 
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that  all  of  its  local  ranking  functions  arc  normalized.  G 3  is  obtained  by  pro¬ 
gressively  normalizing  local  ranking  functions  for  each  variable.  We  first  study 
the  case  where  only  the  local  ranking  function  on  the  root  variable  in  G2  is 
sub- normalized: 

Proposition  4.  Let  G 2  be  the  network  obtained  from  the  combination  step. 
Assume  that  only  the  root  variable,  denoted  by  D.  is  sub-normalized.  Let 
inin<b£DD{KGz{di))—P  and  O<0.  G3  is  such  that: 

—  The  structure  of  G3  is  identical  to  the  one  of  G2, 

-  VX,  X^D,  V.r GDx,  \/uxeDUx*  ^G3{x\ux)=kC2(^M, 

-  V  cIiGDd.  KGa(dl)=KG2(dl)-0. 

Then,  Vc^Gi?,  /6G3(^)=^02(^)"mini(KG2(^t))- 

After  this  transformation,  the  local  ranking  function  relative  to  D  is  re-normalizcd 
while  the  joint  one  encoded  by  the  network  G3  satisfies  Equation  8. 


Example  (Continued) 

Figure  2  shows  that  the  local  ranking  function  relative  to  the  root  node  F  of 
network  G2  (obtained  after  the  combination  of  network  G  with  the  observation 
S—No)  is  sub- normalized.  The  normalization  of  this  ranking  function  according 
to  Proposition  4  gives  the  network  G3  of  Figure  2.  One  can  easily  check  that  the 
joint  ranking  function  encoded  by  network  G3  satisfies  Equation  8. 


Fig.  2.  Initial  OCF-bascd  network  G  and  G2  (resp.  G3)  obtained  after  the  combination 
(reap,  normalization)  step 


Let  us  now  deal  with  the  case  where  the  sub-normalized  function  is  relative 
to  a  variable  f)  which  is  not  a  root.  Let  us  denote  by  C  the  parent  of  D.  In 
this  case,  the  ranking  function  of  G  must  be  altered  in  order  to  keep  unchanged 
the  underlying  joint  function.  The  normalization  of  a  noil  root  variable  D  is 
performed  using  Proposition  5  without  changing  the  global  ranking  function: 

Proposition  5.  Let  G2  be  the  network  obtained  from  the  combination  step.  Let 
D  denote  the  variable  whose  ranking  function  is  sub-normalized.  Let  G  be  the 
parent  variable  of  D  and  c*  be  the  value  of  C  such  that  mindleDr>{^G2{di\ck)) 
—0  with  O<0.  Network  G3  is  such  that  it  has  the  same  structure  as  G 2  and. 
-  VX,  X±D  and  X^C,  kg3{x\ux)=kg2{x\uxI 
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2d 


-  Vr/iGDo,  VcjGDc, 


KG3(^/|cj) 


K«2(^kj)  -  P  */ci  =  Cfc 
«c;2  (<•/;  kj )  otherwise 


VcjeDc,  VuC]€Duc, 


HG.i(Cj\uCj) 


f  KC2(^|«cJ  +  ift'j  =  <‘k 

\  hC2 {Cj  \  aCj )  otherwise 


Then,  Vw;Ei?,  ^<72(^0“ kg:*(w). 

As  it  is  shown  on  the  example  of  Figure  3  (see  variable  B  of  network  G 2),  if  after 
the  re- normalization  of  D .  its  parent  C  become  in  turn  sub-normalized,  then  the 
normalization  process  should  be  repeated  until  reaching  a  root  variable.  Once  a 
root  is  reached,  it  is  enough  to  re-normalize  according  to  Proposition  4  to  get 
an  OCF- based  network  where  all  the  local  ranking  functions  are  normalized. 

Example  (Continued) 

Here,  the  network  G  is  limited  to  variables  5,  B  and  //.  Figure  3  shows  that 
the  local  ranking  function  relative  to  the  non  root  node  B  of  network  G 2  (ol>- 
tained  after  the  combination  of  network  G  with  the  observation  S—No)  is  sub¬ 
normalized.  The  normalization  of  this  ranking  function  according  to  Proposition 
5  gives  the  network  G3-a  of  Figure  3.  Now  the  normalization  of  B  renders  // 
sub-normalized.  This  latter  is  normalized  according  to  Proposition  1  giving  the 
network  G'3-b  of  Figure  3.  2,  One  can  check  that  the  joint  ranking  function  en¬ 
coded  by  network  G3-6  satisfies  Equation  8  and  G3-6  is  completely  normalized. 
We  provide  in  the  following  an  application  scenario  of  OCF- based  causal  net¬ 
works  in  the  area  of  computer  security. 


G3-b 


G3a 


Fig.  3.  Initial  OCF- based  network  G  and  G2  (resp.  G 3  a  and  (73  —  b)  obtained  after 
the  combination  (resp.  normalization)  step 
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5  Application  to  Predicting/Preventing  Dangerous 
Attacks 

Alert  correlation  [5]  plays  ail  important  role  in  nowadays  computer  security  in¬ 
frastructures.  It  consists  in  analyzing  the  alerts  triggered  by  one  or  multiple 
intrusion  detection  systems  and  security  tools  in  order  to  provide  a  synthetic 
and  high-level  view  of  the  interesting  malicious  events  targeting  the  information 
system.  In  this  application,  wc  are  concerned  with  predicting / preventing  severe 
attacks  which  often  are  the  final  step  in  multi-step  attacks.  Clearly,  there  is  a 
need  for  i)  an  easy  elicitation  method  in  order  to  allow  security  administrators 
to  express  their  domain  knowledge  (on  the  security  threats,  vulnerabilities,  etc.) 
and  ii)  a  method  to  reason  given  observations  (data  directly  collected  from  the 
information  systems)  and  interventions  (after  manipulations  and  actions  under¬ 
taken  by  the  administrators,  attackers,  etc.).  OCF-based  causal  networks  offer 
several  advantages  for  the  severe  attack  prediction/prevention  problem  since  it 
makes  it  easy  for  the  administrators  to  elicit  their  knowledge  and  allows  them  to 
assess  the  plausibility  that  an  event  occur,  that  an  attacker  reaches  a  given  ob¬ 
jective  given  some  observed  events,  etc.  It  also  allows  them  to  determine  which 
countermeasures  should  be  taken  in  order  to  prevent  a  given  attack. 

An  OCF-Basod  Model  for  Severe  Attack  Prediction/Prevention.  We 
are  interested  in  anticipating  severe  attaeks  in  order  to  prevent  them  by  taking 
the  appropriate  countermeasures  (such  as  preventing  the  suspected  user  from 
following  his  attack).  The  actions  that  may  be  undertaken  by  attackers  and 
their  possible  consequences,  the  security  policy  and  the  countermeasures  taken 
by  security  administrators,  etc.  clearly  involve  causal  relationships  that  can  be 
modeled  by  a  causal  network  which  can  be  used  for  instance  to  evaluate  the 
plausibility  of  different  scenarios.  We  propose  a  model  for  this  problem  and  we 
define  the  following  variable  categories 

Observationed/intcrventionl  variables:  They  represent  relevant  variables  for 
monitoring  the  information  system.  For  instance,  the  number  of  HTTP 
requests  sent  to  a  server  represents  a  relevant  information  for  detecting/ 
preventing  denial  of  service  attacks. 

Attack  objective  variables:  They  represent  the  final/intermediate  objectives 
targeted  by  the  attackers.  For  example,  gaining  a  local  access,  a  root  access , 
etc.  are  among  the  most  recurrent  objectives  of  nowadays  internet  hackers. 

In  this  model,  observational/intc'n'cntionl  variables  are  either  directly  observed 
or  manipulated  (for  instance,  a  network  monitor  can  count  the  number  of  in¬ 
bound  HTTP  requests,  some  variables  can  however  be  manipulated  by  the  ad¬ 
ministrators'  interventions  such  as  configuring  a  firewall  to  stop  the  requests 
coming  from  a  given  suspected  host...)  while  attack  objective  variables  are  asso¬ 
ciated  with  the  attacks  administrators  may  want  to  prediet/prevent.  While  the 
network  structure  easily  encodes  the  causal  relationships  between  the  relevant 
variables,  the  a  priori  and  conditional  ranking  functions  allow  to  easily  weight 
the  uncertainty  and  the  influence  of  each  variable  on  its  children. 
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5.1  Scenario  Evaluation  and  Countermeasure  Determination 

After  an  OCF-based  network  is  built  based  on  the  domain  knowledge,  it  can 
then  be  efficiently  used  for  different  tasks.  In  particular,  it  can  be  used  for 

i)  Scenario  evaluation:  Given  an  OCF-causal  network  representing  the  ad¬ 
ministrators’  knowledge,  one  can  evaluate  the  plausibility  of  any  event  of  interest 
such  as  the  one  that  an  attacker  reaches  a  given  attack  objective  having  observed 
some  security  events  in  the  audit  data. 

ii)  Countermeasure  determination:  The  aim  of  this  task  is  to  determine 
what  action(s)  should  be  taken  in  order  to  prevent  an  attacker  from  attaining 
a  given  objective.  Administrators  can  intervene  on  some  variables  and  assess 
the  plausibility  that  a  given  attack  objective  is  attained  in  order  to  determine 
whether  this  action  in  adequate  or  insufficient  for  preventing  from  this  attack. 

It  is  obvious  that  there  is  a  need  in  this  application  before  actually  taking  coun¬ 
termeasures  to  intervene  on  the  model  (instead  of  directly  intervening  on  the 
system)  in  order  to  check  whether  a  given  intervention  (here  a  countermea¬ 
sure)  will  aid  to  secure  the  information  system  or  allow  an  attacker  to  attain 
his  objective,  etc.  By  evaluating  different  scenarios,  the  users  can  determine  the 
most  appropriate  countermeasures.  Finally,  note  that  it  is  important  to  take 
into  account  the  order  of  arrival  of  observations/interventions.  For  instance,  for 
a  security  administrator,  observing  a  Web  server  crash  then  intervening  on  the 
system  by  stopping  the  network  will  need  lead  to  the  same  conclusions  as  first 
stopping  the  network  then  observing  the  Web  server  crash.  Clearly,  our  approach 
for  handling  sequences  of  both  observations  and  interventions  is  relevant  for  the 
severe  attack  predict  ion/ prevention  problem. 

G  Conclusion 

This  paper  addressed  important  issues  regarding  belief  change  in  OCF-based 
networks  and  handling  sequences  of  both  observations  and  interventions  It  pro¬ 
vided  three  major  contributions:  a)  We  showed  that  the  well-known  graph  mu¬ 
tilation  and  augmentations  methods  for  handling  interventions  in  probabilistic 
graphs  have  natural  counterpart  in  OCF  networks,  b)  We  proposed  an  OCF- 
based  counterpart  of  an  efficient  method  for  handling  observations  in  causal 
graphs  by  directly  performing  equivalent  transformations  on  the  initial  graph. 
This  method  allows  to  efficiently  integrate  new  observations  and  providing  a 
graphical  counterpart  for  the  conditioning  operation,  c)  We  provided  a  real  ap¬ 
plication  scenario  in  the  field  of  computer  security  highlighting  the  importance 
of  reasoning  in  presence  of  sequences  of  observations  and  interventions. 
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Abstract.  Query-focused  multi -document  summarization  aims  to  create  a 
compressed  summary  biased  to  a  given  query  .  This  paper  presents  a  context- 
sensitive  approach  hased  on  manifold  ranking  of  sentences  to  this  summariza¬ 
tion  task.  The  proposed  context  enhanced  manifold  ranking  approach  not  only 
looks  at  the  sentence  itself,  hut  also  considers  its  surrounding  contextual  infor¬ 
mation.  Compared  to  the  existing  manifold  ranking  approach  which  totally  ig¬ 
nores  the  contextual  information  of  a  sentence,  this  approach  can  capture  more 
additional  relevant  information  which  is  especially  necessary  for  formulating 
the  relationships  between  short  text  snippets  like  sentences.  Experiments  are 
conducted  on  the  DUC  2005  and  DUC  2006  data  sets  and  the  ROUGE  evalua¬ 
tion  results  demonstrate  the  advantages  of  the  proposed  approach. 

Keywords:  Query-focused  multi-document  summarization,  context-sensitive 
manifold  ranking. 


1  Introduction 

With  the  growing  popularity  of  the  Internet  and  a  variety  of  information  services, 
obtaining  the  desired  information  has  become  a  serious  problem  in  the  information 
age.  As  such,  new'  technologies  that  can  process  information  efficiently  arc  needed. 
Automatic  document  summarization,  which  is  the  process  of  reducing  the  size  of 
documents  while  preserving  the  important  semantic  content,  is  an  essential  technol¬ 
ogy  to  overcome  this  obstacle  Most  of  the  summarization  work  done  till  date  follow 
the  sentence  extraction  framework,  which  ranks  sentences  in  some  way  and  selects 
top-ranked  sentences  from  original  documents  to  form  summaries.  Extractive  summa¬ 
rization  generally  falls  into  two  categories  according  to  the  nature  of  summarization. 
They  are  generic  summarization,  which  aims  at  extracting  a  summary  about  general 
ideas  of  documents  and  query-focused  summarization,  which  aims  at  not  only  extract¬ 
ing  the  important  information  contained  in  the  documents,  but  also  guaranteeing  that 
the  extracted  information  is  biased  to  the  given  query.  What  we  are  interested  in  this 
paper  is  query-focused  summarization. 

Query-focused  multi-document  summarization  has  drawn  much  attention  in  recent 
years  due  to  its  applicability  and  merits  in  real-world  applications.  Since  it  is  able  to 
provide  concise  information  corresponding  to  the  specific  queries  from  the  different 
users,  it  has  been  applied  to  the  services  like  personalized  Web  service  or  document 
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understanding  to  support  the  various  interests  of  users.  In  contrast  to  the  conventional 
task  of  question  answering  (QA)  that  mainly  focuses  on  simple  factoid  questions  and 
results  in  precise  answers  such  as  person,  location  or  date,  etc.,  in  the  case  of  query- 
foeused  summarization,  the  queries  are  mostly  real-world  complex  questions  (e.g., 
“Identify  and  describe  types  of  organized  crime  that  crosses  borders  or  involves  more 
than  one  country.”).  Such  complex  questions  make  summarization  tasks  more  chal¬ 
lenge  and  meanwhile  have  a  very  important  role  to  play. 

Recently,  manifold  ranking  algorithm  has  been  exploited  for  query-focused  multi¬ 
document  summarization,  such  as  in  [1].  The  manifold  ranking  based  approaches  first 
constructed  a  weighted  graph  representing  query  and  sentences  as  vertices.  Then  the 
positive  ranking  score  of  query  was  iteratively  propagated  to  nearby  vertices  via  the 
structure  of  the  graph.  Finally  all  sentences  were  ranked  according  to  their  ranking 
scores,  with  a  larger  score  indicating  higher  relevance.  Inspired  by  the  success  of 
manifold  ranking,  in  this  paper  we  propose  an  enhanced  approach  to  further  integrate 
the  contextual  information  of  sentences  into  manifold  ranking  for  query-focused 
multi-document  summarization.  The  motivation  to  this  approach  is  the  consensus  that 
short  text  snippets  like  sentences  often  contain  insufficient  information  to  measure  the 
relationships  between  them  and  to  support  ranking  of  them.  In  our  approach,  we  use 
one  preceding  and  one  following  sentences  of  the  sentence  currently  under  concern  as 
the  additional  contextual  information  to  enrich  the  sentence  representation  or  to  refine 
the  standard  sentence-to-sentence  cosine  similarity  measure  and  develop  four  strate¬ 
gies  to  construct  the  context-sensitive  affinity  matrixes,  which  arc  essential  to  a  mani¬ 
fold  ranking  algorithm.  Compared  to  the  existing  manifold  ranking  approach,  our 
approach  can  capture  more  additional  relevant  information  by  using  contextual  sen¬ 
tences.  The  experiments  conducted  on  the  data  sets  from  DUC  2005  and  DUC  2006 
show  that  the  summarization  results  with  contextual  information  are  better  than  that 
those  without  contextual  information,  achieving  the  state-of-the-art  performance. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  reviews  related 
work.  Section  3  introduces  the  proposed  manifold  ranking  algorithm  using  contextual 
information  of  sentences.  Section  4  then  presents  experiments  and  evaluations.  Fi¬ 
nally,  Section  5  concludes  the  paper. 


2  Related  Work 

A  variety  of  summarization  approaches  have  been  proposed  in  the  literature.  These 
approaches  were  either  extractive  or  abstractive.  Extractive  summarization  assigned  a 
significance  score  to  each  sentence  and  extracted  the  sentences  with  highest  scores  to 
form  the  summaries.  Abstraction  summarization,  on  the  other  hand,  involved  a  certain 
degree  of  understanding  of  the  content  conveyed  in  the  original  documents  and  cre¬ 
ated  the  summaries  based  on  information  fusion  and/or  language  generation  tech¬ 
niques  [2],  Like  most  researchers  in  this  field,  we  follow  the  extractive  summarization 
framework  in  this  work. 

Under  the  framework  of  extractive  summarization,  sentence  ranking  is  the  issue  of 
most  concern.  Traditional  feature -based  approaches  evaluated  sentence  significance 
and  ranked  the  sentences  depending  on  the  features  that  were  well -designed  to  charac¬ 
terize  the  different  aspects  of  the  sentences.  The  centroid-based  approach  [3]  was  one 
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of  the  most  popular  feature-based  summarization  approaches.  Other  statistical  and 
linguistic  features  such  as  term  frequency,  sentence  position,  sentence  dependency 
structure  and  query  relevance  etc.  have  also  been  extensively  investigated  in  the  past. 
In  recent  years,  graph-based  approaches  have  been  proposed  to  rank  sentences.  These 
approaches  modeled  a  document  or  a  set  of  documents  as  a  weighted  text  graph,  took 
into  account  the  global  information  and  recursively  calculated  sentence  significance 
from  the  entire  text  graph  rather  than  only  relying  on  the  unconnected  individual  sen¬ 
tences.  LexRank  [4]  and  TextRank  [5]  were  examples  of  such  approaches.  Both  of 
them  were  motivated  by  PageRank,  which  has  been  successfully  used  for  ranking 
Web  pages  in  the  Web  Graph. 

Most  existing  query-focused  multi-document  summarization  approaches  incorpo¬ 
rated  the  information  of  the  given  query  into  the  generic  summarizers  in  order  to 
extract  the  sentences  suiting  the  user’s  declared  information  need.  In  [6],  a  query- 
based  feature  that  computed  the  similarity  between  sentence  and  query  was  combined 
with  a  set  of  document-based  features.  The  role  of  the  query  words  and  the  named 
entities  appeared  in  the  query  are  especially  emphasized  in  [7].  Later,  a  topic-sensitive 
version  of  PageRank  was  proposed  to  incorporate  the  relevance  of  a  sentence  to  the 
query  into  LexRank  to  get  a  biased  PageRank  ranking  [8J.  As  a  matter  of  fact,  for 
those  graph-based  approaches,  the  influence  of  the  query  was  normally  reflected  in 
the  formulation  of  sentence  vertices  in  a  text  graph. 

Different  from  the  traditional  query-focused  summarization  approaches,  which 
were  usually  the  simple  extensions  of  generic  summarizers  and  did  not  uniformly  fuse 
the  information  in  the  query  and  the  documents.  Wan  et  al.  11  ]  proposed  a  manifold 
ranking  based  approach  to  make  uniform  use  of  sentence-to-sentenee  and  sentence-to- 
query  relationships.  A  weighted  graph  was  built  where  the  vertices  included  both  the 
query  description  and  the  sentences  in  the  documents.  The  manifold  ranking  w7as 
employed  to  iteratively  propagate  the  relevance  of  the  query  to  nearby  vertices  via  the 
graph  structure.  The  ranking  score  of  a  sentence  obtained  by  this  process  indicated  the 
topic-biased  informativeness  of  the  sentence  and  those  with  high  ranks  are  chosen  to 
form  the  summary. 


3  Context-Sensitive  Manifold  Ranking  Approach 

Manifold  ranking  is  a  semi-supervised  learning  that  explores  the  relationship  among 
all  the  data  points  in  the  feature  space  [9,  10].  It  has  two  versions  regarding  the  differ¬ 
ent  tasks:  (1)  to  rank  the  data  points,  or  (2)  to  predict  the  labels  of  the  unlabeled  data 
points.  For  the  task  of  ranking,  the  prior  assumptions  of  it  include  (1)  nearby  points 
are  likely  to  have  the  same  ranking  scores;  and  (2)  points  on  the  same  structure  (typi¬ 
cally  referred  to  as  a  cluster  or  a  manifold)  are  likely  to  have  the  same  ranking  scores. 

3.1  Notation 

In  this  paper,  each  sentence,  either  a  document  sentence  or  a  query  sentence,  is  repre¬ 
sented  by  an  ///  dimensional  feature  vector  x  and  forms  a  sentence  point  in  the  Euclid¬ 
ian  space.  Let  =  Rm ,  where  the  first  point  a0  is  the  query  description 

and  the  rest  n  points  are  the  sentences  to  be  ranked  according  to  their  relevance  to  the 
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query.  Note  that  because  the  topic  description  is  usually  short,  in  our  experiments  we 
treat  it  as  a  pseudo-sentence  and  it  is  processed  in  the  same  way  as  the  other  sentences. 

Let  y  =  [y0,***,  yjr  ,  where  y0  =  1  corresponding  to  the  query  sentence  x0  and 
y{  =0 ,(/  =  1,. ..,/*)  for  all  the  sentences  in  the  documents.  Let  f X  — >R  denote  a 
ranking  function  which  assigns  to  each  sentence  point  a  ranking  score  fi . 

3.2  Basic  Manifold  Ranking  Algorithm 

The  basic  manifold  ranking  algorithm  is  presented  in  Table  1 .  An  intuitive  description 
of  this  algorithm  is:  a  weighted  graph  is  first  formed  which  takes  each  sentence  as  a 
vertex;  assign  a  positive  ranking  score,  to  the  query  while  zero  to  the  remaining  sen¬ 
tences;  all  the  vertices  then  spread  their  scores  to  the  nearby  vertices  via  the  weighted 
graph;  the  spread  process  is  repeated  until  a  global  stable  state  is  reached,  and  all  the 
vertices  except  the  query  will  have  their  own  scores  according  to  which  they  will  be 
ranked.  The  propagation  of  ranking  score  reflects  the  relationship  of  all  vertices,  since 
in  the  weighted  graph,  distant  vertices  will  have  different  ranking  scores  unless  they 
belong  to  the  same  cluster  consisting  of  many  points  that  help  to  link  the  distant 
points,  and  nearby  vertices  will  have  similar  ranking  scores  unless  they  belong  to 
different  clusters.  In  the  context  of  our  task,  there  is  only  one  query  in  the  query  set. 
The  resultant  ranking  score  of  a  sentence  in  the  document  is  in  proportion  to  the  prob¬ 
ability  that  it  is  relevant  to  the  query,  with  large  ranking  score  indicating  high  prob¬ 
ability. 


Table  1.  Basic  Manifold  Ranking  Algorithm 

1.  Sort  the  cosine  similarities  among  vertices  in  ascending  order.  Repeat 
connecting  the  two  vertices  with  an  edge  according  to  the  order  until  a 
connected  graph  is  obtained. 

2.  Form  the  affinity  matrix  W  by  cosine  similarities  measure  between  any 
two  vertices,  if  there  is  an  edge  linking  the  two  vertices.  Let  Wu  =  0  . 

i  l 

3.  Symmetrically  normalize  W  by  S  =  D  2WD  2  in  which  D  is  the 
diagonal  matrix  with  (/,/) -element  equal  to  the  sum  of  the  ith  row  of 
W. 

4.  Iterate  /(/  +  !)  =  aSf  (t)  +  (\-a)y  until  convergence,  where  (X  is  a 
parameter  in  [0,1) ,  and  y  is  the  original  labeling. 

5.  Let  f*  denotes  the  limits  of  the  sequence  {/(/)}  •  Rank  each  sentence 
according  to  its  ranking  score  in  /  . 


In  the  above  iterative  algonthm,  the  normalization  in  the  third  step  is  necessary  to 
prove  the  algorithm’s  convergence.  During  the  fourth  step,  each  sentence  point  re¬ 
ceives  the  information  from  its  neighbors  (first  term),  and  also  retains  its  initial  in¬ 
formation  (second  term).  The  parameter  of  manifold  ranking  weight  a  specifies  the 
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relative  contributions*  to  the  ranking  scores  from  neighbors  and  the  initial  ranking 
scores.  Self-reinforcemcnt  is  avoided,  therefore  the  diagonal  elements  of  the  affinity 
matrix  are  set  to  zero. 

The  theorem  in  [10]  guarantees  that  the  sequence  {/ (f))  converges  to 

/*  =  0(I-aS)  'v  (1) 

where  /3  =  1  -  a  . 


3.3  Context-Sensitive  Affinity  Matrix 

A  key  part  in  the  above  manifold  ranking  algorithm  is  the  affinity  matrix  W.  The 
definition  of  W  mainly  involves  two  essential  aspects:  (1)  pairwise  similarity  metric, 
(2)  sentence  vertex  representation. 

In  previous  use,  manifold  ranking  algorithm  proposed  in  text  processing  only 
makes  use  of  content  words  of  the  current  sentences  under  concern.  This  sentence 
representation  can  express  very  limited  information  of  each  sentence  and  the  cosine 
similarity  calculated  based  on  such  representation  may  not  truly  reflect  the  similarity 
between  the  sentences.  Table  2  shows  a  subset  of  a  cluster  in  DUC  2005,  and  the 
corresponding  cosine  similarity  matrix  is  shown  in  Table  3. 

From  the  cosine  similarity  values  shown  in  Table  3,  we  can  see  that  the  sentence  2 
is  similar  to  the  sentence  1.  However,  from  semantic  perspective  of  the  original 
document,  we  think  the  sentence  2  is  much  more  similar  to  the  sentence  4  than  other 
sentences.  The  reason  of  this  problem  may  be  imputable  to  the  fact  that  we  ignore  the 
contextual  information  of  the  sentences. 


Table  2.  The  First  6  Sentences  in  a  Subset  of  Clusier  d31 1 1  from  DUC  2005 


SenNo 

Text 

1 

International  Company  News:  VW  fails  to  convince  GM  over  car 
factory  copying' 

2 

VOLKvSWAGEN  has  failed  to  convince  General  Motors  that  its 
plans  for  a  revolutionary  car  plant  in  Spain  are  not  a  copy  of  a  project 
drafted  previously  by  the  US  group. 

3 

'We  have  a  right  to  be  sceptical,'  Mr  David  Herman,  chairman  of 
GM’s  German  subsidiary  Adam  Opel,  said  yesterday. 

4 

'It  would  be  a  real  tour  de  force'  if  Mr  Jose  Ignacio  Lopez  de 
Arriortua,  GM’s  former  procurement  chief  who  is  now  at  VW,  had 
managed  to  develop  a  new  concept  between  mid-March,  when  he  left 
the  US,  and  mid-June  when  he  announced  VW's  plans. 

5 

Mr  Herman  was  responding  to  claims  in  a  letter  received  from  VW 
in  which  Mr  Ferdinand  Piech,  chairman,  said  the  German  company 
did  not  have  any  confidential  plans  or  documents  about  GM's  ultra- 
low-cost  factory  project. 

6 

Mr  Herman  confirmed  that  he  had  written  to  Mr  Piech  before  Mr 
Lopez’s  announcement,  suggesting  that  he  consider  the  possible 
consequences  if  VW's  project  were  the  same  as  the  one  developed  at 
GM  under  Mr  Lopez’s  direction. 
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Table  3.  Cosine  Similarities  of  Sentences  in  Table  2 


SenNo 

1 

2 

3 

4 

5 

6 

1 

0 

0.3081 

0.0499 

0.0972 

0.1499 

0.0745 

2 

0.3081 

0 

0.0000 

0.0292 

0.0749 

0.0346 

3 

0.0499 

0.0000 

0 

0.0533 

0.2681 

0.2078 

4 

0.0972 

0.0292 

0.0533 

0 

0.1300 

0.2601 

5 

0.1499 

0.0749 

0.2681 

0.1300 

0 

0.3515 

6 

0.0745 

0.0346 

0.2078 

0.2600 

0.3515 

0 

In  order  to  improve  the  performance  of  summarization,  we  combine  the  contextual 
information  into  the  basic  manifold  ranking.  For  this  purpose,  a  sentence  point 

xi  e  Rm  is  re-defined  in  both  the  original  domain  using  its  original  feature  vector 

xf  e  Rm'  ,  and  in  the  contextual  domain  by  introducing  xf e  Rn'c ,  which  yields  tnc 

dimensional  contextual  feature  vectors  representing  the  surrounding  contextual  sen¬ 
tences.  We  combine  one  preceding  and  one  following  sentences  of  the  current  sen¬ 
tence  as  a  new'  pseudo  sentence,  and  deem  this  new'  pseudo  sentence  as  the  contextual 
information  of  the  current  sentence.  Then  the  contextual  information  and  the  original 
information  of  the  current  sentence  lead  to  two  different  similarity  measures,  which 
can  be  easily  computed  and  combined.  For  example,  we  can  sum  the  original  and  the 
contextual  dedicated  affinity  matrices  (e.g.,  Ws  and  Wc  ),  or  introduce  the  cross¬ 
information  between  the  original  and  the  contextual  features  (e.g.,  Wsc  and  Wcs  )  in 
the  formulation. 

In  the  following,  we  present  four  different  strategies  for  joint  consideration  of  the 
original  and  the  contextual  information  of  sentences  in  a  unified  framework  for  affin¬ 
ity  matrix  construction 

•  The  Stacked  Affinity  Matrix 

The  most  commonly  adopted  strategy  in  affinity  matrix  construction  for  the  manifold 
ranking  algorithm  is  to  exploit  the  information  of  an  original  sentence  xi  =  xf  .  How¬ 
ever,  performance  can  be  improved  by  including  both  the  original  and  the  contextual 
information  of  the  sentences.  This  is  usually  done  by  means  of  the  “stacked”  ap¬ 
proach,  in  which  the  new'  feature  vectors  are  built  from  the  concatenation  of  the  sen¬ 
tence  and  its  context  features. 

Let  us  define  x ,  as  the  concatenation  of  the  two  feature  vectors  xf  and  xf  .  That 
is,  jcf-  =  {a;\a-  } ,  then  the  corresponding  ‘stacked'  affinity  matrix  is: 

Stacked  =Wl\i,Xj)  =  sim(xhXj)  (2) 

which  does  not  include  explicit  cross  relations  between  xf  and  xf  .  sini(Xj,Xj )  is  the 

cosine  similarity  between  the  two  sentence  points  jcf  and  Xj .  Table  4  below  shows 

the  cosine  similarities  of  the  sentences  using  stacked  strategy  in  Table  2.  From  the 
table,  we  can  see  that  this  time  the  sentence  2  is  much  more  similar  to  the  sentence  4 
when  the  additional  contextual  information  is  involved. 
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Table  4.  Cosine  Similarities  of  Sentences  in  Table  2  using  Stacked  Strategy 


SenNo 

1 

2 

3 

4 

5 

6 

1 

0 

0.2485 

0.4086 

0.0718 

0.0719 

0.0713 

2 

0.2485 

0 

0.1288 

0.3153 

0.1485 

0.1591 

3 

0.4086 

0.1288 

0 

0.1039 

0.5830 

0.1653 

4 

0.0718 

0.3153 

0.1039 

0 

0.2313 

0.5178 

5 

0.0719 

0.1485 

0.5830 

0.2313 

0 

0.3041 

6 

0.0714 

0.1591 

0.1653 

0.5178 

0.3041 

0 

•  The  Direct  Summation  Affinity  Matrix 

A  simple  composite  affinity  matrix  combining  the  original  and  the  contextual  infor¬ 
mation  can  be  derived  from  the  concatenation  of  the  original  sentence  affinity  matrix 
and  the  contextual  sentence  affinity  matrix.  That  is: 

Wd,rc,i(xi'-xj)  =  W,(Xj  ,Xj)  +  Wc(xj ,XCj) 

v  (3) 

=  sim{.Xj ,  Xj )  +  sim(xj ,  .v‘- ) 

Note  that  dim(jc/ )  =  ms  ,  dim(.v-  )  =  mv  ,  and  dimiVV )  =  dim(H^ )  =  dim(VVr)  =  nxn  , 
where  dim  denotes  the  dimension.  By  this  affinity  matrix  construction  strategy,  the 
relationships  between  two  sentences  are  judged  according  to  not  only  the  relationship 
between  the  sentences  themselves,  but  also  the  relationship  between  the  contexts  of 
the  sentences. 

•  Weighted  Summation  Affinity  Matrix 

Alternatively  the  composite  affinity  matrix  that  balances  the  original  and  the  contex¬ 
tual  information  in  (3)  can  be  constructed  as  follows: 

^ lighted  j)  =  flW*  Of  <X*)  +  (l-T})- Wr  (xf ,  Xj  ) 

(4) 

=  t]  sim(xf .  Xj )  +  (1  -  77)  •  sim(x f .  Xj ) 

where  t)  is  a  positive  real-valued  parameter  (0<rj  <  1) ,  which  constitutes  a  tradeoff 
between  the  original  and  the  contextual  information  in  forming  the  sentence  affinity 
matrix.  This  composite  affinity  matrix  allows  us  extract  some  information  from  the 
best  tuned  rj  parameter. 

•  The  Cross-information  Affinity  Matrix 

The  preceding  direct  summation  matrix  ean  be  conveniently  modified  to  account  for 
the  cross  relationship  between  the  original  and  the  contextual  information.  That  is, 
it  can  be  expressed  as  the  sum  of  the  four  positive  definite  matrices,  accounting  for 
the  affinity  between  the  two  sentences’  original  content,  between  their  contextual 
sentences,  and  the  cross-terms  between  the  original  and  contextual  counterparts. 
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mx;,xj)  =  WS(X?  'X^  +  W^xf  ,XCj) 

+Wu.(xis,xCj)+W„(xf,Xj) 

.  (5) 

=  sitti(x / ,  -XSj )  +  sim(  xf ,  jc‘*  ) 

+  sim(xf  ,xCj)  +  sim(xf ,  jcj ) 

where  (*/,*£)  is  the  cosine  similarity  between  a/  and  x(j  ,  Wr5(jc- , a*J)  is  the 
cosine  similarity  between  Af  and  jej  .  The  only  restriction  for  this  formulation  to  be 

valid  is  that  xf  and  x'j  need  to  have  the  same  dimension  (  Nc  =  Ns ).  This  can  be 

easily  achieved  as  the  dimension  of  the  sentence  vector  is  dependent  on  word  number 
in  document  set,  which  is  a  fixed  value. 

Once  wc  obtain  the  modified  affinity  matrix,  we  can  use  them  to  perform  the  mani¬ 
fold  ranking  algorithm  again  to  improve  the  sentence  ranking  results.  The  overall 
procedure  is  the  same  as  described  in  the  ranking  algorithm  in  Table  1 

3.4  Summary  Generation  and  Redundancy  Control 

In  multi-document  summarization,  the  number  of  the  documents  to  be  summarized 
can  be  very  large.  This  makes  information  redundancy  problem  appear  to  be  more 
serious  in  multi-document  summarization  than  in  single-doeument  summarization. 
Redundancy  control  becomes  an  inevitable  process.  vSince  our  focus  in  this  study  is 
the  design  of  effective  (sentence)  ranking  algorithms,  we  apply  a  straightforward  but 
effective  sentence  selection  principle.  We  incrementally  add  into  the  summary  the 
highest  ranked  sentence  of  concern  if  it  doesn’t  significantly  repeat  the  information 
already  included  in  the  summary  until  the  word  limitation  of  the  summary  is  reached. 

4  Experiments 

We  eonduet  the  experiments  on  the  data  sets  from  the  DUC  2005  and  the  DUC  2006. 
In  these  two  years,  query-focuscd  multi-document  summarization  is  the  only  task. 
According  to  the  task  definitions,  systems  are  required  to  produce  a  eoneise  summary 
for  each  document  set  and  the  length  of  summaries  is  limited  to  250  English  words. 

A  well-recognized  automatic  evaluation  toolkit  ROUGE  [11]  is  used  for  evalua¬ 
tion.  It  measures  summary  quality  by  counting  the  overlapping  units  between  system¬ 
generated  summaries  and  human-written  reference  summaries.  We  report  three 
common  ROUGE  scores  in  this  paper,  namely  ROUGE-1,  ROUGE-2  and  ROUGE- 
SU4  which  base  on  Uni-gram  match.  Bi-gram  match,  and  unigram  plus  skip-bigram 
match  with  maximum  skip  distance  of  4.  Documents  and  queries  are  pre-proeessed  by 
segmenting  sentences  and  splitting  words.  Stop  words  are  removed  and  the  remaining 
words  are  stemmed  using  Porter  stemmer. 

4.1  Performance  Evaluation  and  Comparison 

In  the  experiments,  the  manifold  ranking  based  summarizer  using  contextual  informa¬ 
tion  is  compared  with  the  two  baselines  employed  in  the  DUC.  They  are  the  lead 
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baseline  and  the  coverage  baseline.  The  lead  baseline  takes  the  first  sentences  one  by 
one  in  the  last  document  in  the  collection,  where  documents  are  assumed  to  be  or¬ 
dered  chronologically.  The  coverage  baseline  takes  the  first  sentence  one  by  one  from 
the  first  document  to  the  last  document.  We  also  present  the  results  of  top  three  sys¬ 
tems  with  the  highest  ROUGE  seores  that  participate  in  the  DUC  2005  and  the  DUC 
2006  for  comparison. 

For  further  comparison  of  the  context-sensitive  manifold  ranking  algorithm,  we 
also  implement  the  basie  manifold  ranking  algorithm  without  using  any  contextual 
information  as  proposed  in  [  1  ]. 

Tables  5  and  6  show  the  comparison  results  on  the  DUC  2005  and  2006  data  sets 
respectively.  The  parameters  of  manifold  ranking  based  approaches  are  set  as  follows: 
a-  0.6.  And  the  parameter  of  the  weighted  summation  affinity  matrix  is  set  as 
i)  =  0.75 .  SI 5  and  S24  etc.  in  the  tables  are  the  IDs  of  those  top  performing  systems 
participated  in  the  DUC,  and  the  other  rows  show  the  results  of  the  proposed  approach 
with  four  different  affinity  matrix  construction  strategies  and  the  two  baselines. 
‘Stacked’  denotes  the  use  of  staeked  affinity  matrix,  ‘Direef  denotes  the  use  of  direct 
summation  affinity  matrix,  ‘Weighted’  denotes  the  use  of  weighted  summation  affin¬ 
ity  matrix,  and  ‘Cross’  denotes  the  use  of  eross-information  affinity  matrix. 


Table  5.  Experimental  Results  on  the  Data  of  DUC  2(X)5 


Systems 

ROUGE- 1 

ROUGE-2 

ROUGE-SU4 

Staeked 

0.38592 

0.07498 

0.13371 

Direct 

0.38951 

0.07501 

0. 1 3385 

Weighted 

0.39005 

0.07515 

0.13397 

Cross 

0.39249 

0.07520 

0.13405 

Wan’s 

0.38523 

0.07496 

0.13353 

SI  5 

0.37665 

0.07381 

0.13260 

S4 

0.37484 

0.07003 

0.12798 

SI  7 

0.36930 

0.07256 

0.12977 

Coverage  Baseline 

0.34659 

0.0601 3 

0.09275 

Lead  Baseline 

0.30583 

0.04875 

0.08154 

Table  6.  Experimental  Results  on  the  Data  of  DUC  2006 


Systems 

ROUGE-1 

ROUGE-2 

ROUGE-SU4 

Staeked 

0.41702 

0.10284 

0.17405 

Direet 

0.41715 

0.10291 

0.17419 

Weighted 

0.41719 

0.10295 

0. 1 7425 

Cross 

0.41734 

0.10358 

0.17430 

Wan’s 

0.41685 

0.10279 

0.17401 

S12 

0.4161  1 

0.10276 

0.17399 

S23 

0.41505 

0.10800 

0.17834 

S24 

0.41020 

0.10727 

0.17431 

Coverage  Baseline 

0.36753 

0.08132 

0.14596 

Lead  Baseline 

0.33574 

0.06942 

0.12439 
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From  Tables  5  and  6,  we  can  see  that  on  the  two  DUC  data  sets,  the  proposed  ap¬ 
proaches  outperform  all  the  top  systems  and  the  baseline  systems  on  all  the  ROUGE 
scores.  When  compared  with  Wan’s  approach,  we  can  also  see  that  after  getting  the 
contextual  information  involved  in  affinity  matrix  construction,  the  enhanced  context- 
sensitive  manifold  ranking  approach  receives  improved  performance  on  both  the 
DUC  2005  and  the  DUC  2006  data  sets.  This  demonstrates  the  advantages  using  con¬ 
textual  information  in  manifold  ranking. 

4.2  Influence  of  Parameter  7)  Used  in  Weighted  Summation  Affinity  Matrix 

Recall  that  in  the  definition  of  the  weighted  summation  affinity  matrix,  the  parameter 
7)  constitutes  a  tradeoff  between  the  original  and  contextual  information  to  form 
sentence  affinity  matrix.  Figure  1  illustrates  the  influence  of  the  parameter  7)  on 
the  summarization  based  on  the  context-sensitive  manifold  ranking  using  weighted 
summation  affinity  matrix.  It  is  observed  that  when  7]  varies  from  0  to  0.7,  the 

performances  of  the  proposed  approach  are  always  worse  than  the  corresponding 
performances  of  the  original  manifold  ranking  approach.  It  is  the  better  case  when 
7)  varies  from  0.7  to  1,  which  demonstrates  that  the  contextual  information  can  help 
to  improve  the  performance,  but  relying  only  on  the  contextual  information 
while  ignoring  the  original  information  of  the  sentences  will  unavoidably  hurt  the 
performance. 


DUC2005  DUC2006 


n 


Fig.  1.  ROUGE- 1  vs.  H 


4.3  Influence  of  Parameter  Tuning 

Figure  2  and  Figure  3  below  demonstrate  the  influence  of  the  manifold  weight  a  in 
the  proposed  enhanced  manifold  ranking  approach  based  four  different  affinity  matri¬ 
ces.  It  is  observed  that  the  small  values  of  a  can  deteriorate  the  summarization  per¬ 
formance,  while  the  performance  of  summarization  will  achieve  relative  stable  state 
when  a  is  around  0.6.  It  proves  that  the  setting  of  a  value  is  reasonable  in  the  above 
experiments. 
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Fig.  2.  ROUGE- 1  vs,  a  in  DUC  2()05 


I 

I" — • — Slacked  •—  Direct  Weighted  (Truss  ' 

'  - - - - 1 


a 


Fig.  3.  ROUGE- 1  vs.  a  in  DUC  2006 


5  Conclusion 

In  this  paper,  we  propose  a  context-sensitive  manifold  ranking  approach  to  multi¬ 
document  summarization.  Our  approach  takes  advantage  of  both  the  original  and  the 
contextual  information  of  the  sentences  from  the  documents.  By  this  approach,  the 
refined  affinity  matrix  can  capture  more  related  information.  The  experimental  results 
show  that  the  proposed  approach  improves  system  performance  and  the  resultant 
system  is  comparable  to  the  top  performing  system  in  the  DUC. 
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Abstract.  Several  key  applications  like  reconimcnder  systems  deal  with 
data  in  the  form  of  ratings  made  by  users  on  items.  In  such  applications, 
one  of  the  most  crucial  tasks  is  to  find  users  that  share  common  interests, 
or  items  with  similar  characteristics.  Assessing  the  similarity  between 
users  or  items  has  several  valuable  uses,  among  which  arc  the  recommen¬ 
dation  of  new  items,  the  discovery  of  groups  of  like-minded  individuals, 
and  the  automated  categorization  of  items.  It  has  been  recognized  that 
popular  methods  to  compute  similarities,  based  on  correlation,  are  not 
suitable  for  this  task  when  the  rating  data  is  sparse.  This  paper  presents 
a  novel  approach,  based  on  the  SimRank  algorithm,  to  compute  similar¬ 
ity  values  when  ratings  are  limited.  Unlike  correlation-based  methods, 
which  only  consider  user  ratings  for  common  items,  this  approach  uses 
all  the  available  ratings,  allowing  it  to  compute  meaningful  similarities. 
To  evaluate  the  usefulness  of  this  approach,  we  test  it  on  the  problem  of 
predicting  the  ratings  of  users  for  movies  and  jokes. 


1  Introduction 

Many  important  applications  like  recoin  mender  systems  deal  with  data  in  the 
form  of  ratings  made  by  users  on  items.  In  such  applications,  one  of  the  most 
crucial  tasks  is  to  find  users  that  share  common  interests,  or  items  with  similar 
characteristics.  Assessing  the  similarity  between  risers  or  items  has  several  valu¬ 
able  uses,  among  which  are  the  recommendation  of  new  items,  the  discovery  of 
groups  of  like-minded  individuals,  and  the  automated  categorization  of  items. 

A  popular  method  to  compute  the  similarity  between  two  users,  found  in 
many  collaborative  filtering  recommender  systems,  is  based  on  the  correlation 
between  the  ratings  made  by  these  users  on  common  items.  As  recognized  by 
several  recent  works  on  this  topic,  such  as  [5,18],  this  method  is  very  sensitive 
to  sparse  data.  For  instance,  while  two  users  can  be  similar  if  they  have  rated 
different  items,  this  method  is  unable  to  evaluate  their  similarity  in  such  eases. 
Moreover,  although  recent  approaches  based  on  dimensionality  reduction  and 
graph  theory  have  been  proposed  for  this  problem,  they  also  have  their  limita¬ 
tions.  For  example,  they  cannot  be  used  in  situations  where  there  are  categorical 
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Fig.  X.  A  bipartite  graph  representing  responses  (sets  of  categorical  values)  given  by 
users  to  items 

ratings  or  other  lion-numerical  rating  types,  such  as  the  one  shown  in  Figure  1, 
and  do  not  provide  an  easy  way  to  integrate  prior  information  on  the  similarities. 

This  paper  presents  a  novel  approach  to  compute  similarities  between  users 
or  items  when  only  a  limited  number  of  ratings  are  available.  Based  on  the 
well-known  algorithm  SimRank  [9],  this  approach  models  the  relations  between 
user  similarities  and  item  similarities  using  a  system  of  linear  equations,  and 
computes  the  similarity  values  by  solving  this  system.  However,  unlike  SimRank 
and  its  recent  extensions,  our  approach  has  the  additional  advantage  of  allowing 
one  to  evaluate  the  agreement  between  any  type  of  ratings,  and  integrate  prior 
similarity  i  i  i  for  i  n  at  ion . 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  present  some  of 
the  most  relevant  work  on  the  topic  and  describe  the  advantages  of  our  approach 
over  these  works.  We  then  present  the  details  of  our  approach  in  Section  3,  and 
illustrate  in  Section  4  its  usefulness  on  the  problem  of  predicting  the  ratings  of 
users  for  movies  and  jokes.  Finally,  Section  5  provides  a  brief  summary  of  our 
work  and  contributions,  and  describes  some  of  its  possible  extensions. 


2  Related  Work 


2.1  Item  Recommendation  and  Sparsity 


Sparsity  is  a  problem  occurring  frequently  in  recommender  systems  when  many 
users  have  provided  ratings  to  a  limited  number  of  items,  or  many  items  have  re¬ 
ceived  only  a  few  ratings.  A  solution  proposed  for  this  problem  consists  in  using 
item  content  information  to  enhance  the  computation  of  similarities  [10,14].  How¬ 
ever,  reliable  content  information  may  not  be  available,  for  example,  if  obtaining 
this  information  requires  expensive  resources  (e.g.,  hand  made  annotations)  or 
is  simply  too  difficult  (e.g.,  audio  or  video  data). 

Dimensionality  reduction  methods  have  also  been  developed  to  alleviate  the 
problem  of  sparsity.  These  methods  work  by  decomposing  the  user-item  rating 
matrix  [2,17]  or  a  sparse  similarity  matrix  [5,6]  into  a  limited  number  of  latent 
factors.  These  factors,  which  represent  high-level  characteristics  of  users  and 
items,  are  then  used  to  predict  new  ratings.  While  decomposition  approaches 
are  among  the  most  accurate  rating  prediction  methods,  they  generally  lack  the 
ability  to  discover  local  relations  in  the  data.  Moreover,  this  class  of  techniques 
can  only  be  used  with  numerical  ratings,  not  categorical  ones. 
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Another  category  of  methods*  proposed  for  recommending  items  in  sparse 
data  uses  graph  theory  to  model  the  interactions  between  users  and  items  and 
measure  the  strength  of  these  relations.  Such  methods  include  approaches  based 
on  geodesic  distance  [15].  diffusion  kernels  [11],  and  random  walks  [5,8,18].  A 
common  problem  with  these  methods  is  their  lack  of  interprctability  and  the 
difficulty  of  translating  ratings  into  link  weights,  for  instance,  if  the  ratings  are 
negative  or  noil- numerical. 

Finally,  a  different  approach,  proposed  in  [3],  computes  item  similarities  by 
solving  a  global  regression  problem  which  finds  the  similarity  values  that  best 
predict  known  ratings  using  an  item-based  nearest-neighbor  formulation.  This 
approach  has  three  main  limitations.  First,  it  relies  on  a  correlation-based  method 
to  compute  the  nearest  neighbors,  which  may  bo  sensitive  to  sparsity.  Also,  the 
item-based  formulation  used  in  this  approach  only  considers  the  ratings  made 
by  common  users,  which  also  creates  problems  when  the  rating  data  is  sparse. 
Finally,  the  item  similarities  computed  by  this  method  depend  on  the  rating 
that  is  predicted,  which  is  not  suitable  to  the  task  of  finding  general  similarities 
between  all  items. 


2.2  SimRank 


The  method  introduced  in  this  paper  is  closely  related  to  the  bipartite  version 
of  the  SimRank  algorithm  proposed  by  Jeh  and  Widom  [9].  Let  U  and  X  be  the 
two  sets  of  nodes  of  a  bipartite  graph  representing,  for  instance,  the  users  and 
items  of  a  reeomrnender  system.  Moreover,  denote  by  Xlt  C  X  be  the  set  of  items 
purchased  by  a  given  user  u  G  li ,  and  let  Ut  C  U  be  the  set  of  users  that  have 
purchased  an  item  i  E  I.  The  similarity  between  two  users*  u  arid  t\  s(u,  v),  is 
obtained  as  the  average  similarity  of  the  items  purchased  by  these  users: 


*(«,  «■) 


Ci 


Z  z 


in 


where  C\  G  [0, 1]  is  a  constant  controlling  the  flow  of  similarity  values  on  the 
graph  links.  Likewise,  the  similarity  between  two  items  i  and  j<  can  be 

computed  as  the  average  similarity  of  users  that  have  purchased  these  items: 


s(hj) 


C2 


Z  Z  s(u-v). 

uelli  v£Uj 


(2) 


C2  having  the  same  role  as  C\.  SimRank  computes  the  similarity  values  by 
updating  them  iteratively  using  equations  (1)  and  (2),  until  a  fixed-point  is 
reached. 

A  significant  limitation  of  this  approach,  in  the  context  of  item  recommen¬ 
dation,  is  that  it  only  considers  the  interactions  between  users  and  items  (o.g., 
purchases)  but  not  the  ratings.  Another  method  called  SimRank ++,  recently 
proposed  in  [1],  extends  SimRank  by  taking  into  account  the  link  weights  as 
modified  transition  probabilities.  In  this  method,  the  similarity  between  two 
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nodes  is  computed  as  a  weighted  average  of  the  similarities  of  their  adjacent 
nodes: 

s(u.  v)  =  Cl  £  ]T  Wui  ■  wvj  ■  (3) 

«ei„  jez„ 

where  win  is  the  normalized  weight  of  the  link  between  u  and  i .  Like  SimRank , 
this  method  also  has  some  limitations.  First,  since  link  weights  are  simply  multi¬ 
plied  it  may  not  be  possible  to  compare  the  agreement  between  the  ratings  made 
by  two  users  on  similar  items,  especially  if  these  ratings  are  non-nurnerical.  Also, 
this  method  does  not  allow  one  to  integrate  prior  knowledge  on  the  similarity 
values,  for  instance,  obtained  by  comparing  the  content  of  items. 

2.3  Contributions 

This  paper  makes  the  following  contributions: 

1.  It  describes  a  novel  approach  to  compute  similarities  that  extends  the  Sim- 
Rank  algorithm  and  its  extensions  in  two  important  ways: 

(a)  It  uses  an  arbitrary  function  to  compare  the  agreement  between  link 
weights,  which  allows  the  use  of  non-nurnerical  ratings. 

(b)  It  provides  an  elegant  way  to  integrate  prior  information  on  the  similarity 
values  directly  in  the  computations. 

2.  Unlike  similarity  measures  based  on  correlation  which  only  use  the  ratings  on 
common  items,  this  approach  considers  all  the  available  ratings,  allowing  it 
to  compute  similarities  between  users  that  have  rated  different  items,  thereby 
reducing  the  sensitivity  to  sparse  data. 

3.  It  presents  a  first  comprehensive  experimental  evaluation  of  a  SimRank- 
based  method  on  the  problem  of  predicting  new  ratings. 

3  A  Novel  Approach 

3.1  The  General  Formulation 

Consider  the  task  of  evaluating  the  similarity  s(u,  v)  between  two  users  u  and  v. 
A  simple  approach,  used  in  several  item  recommendation  systems  is  to  compute 
5(1.1,  v)  as  the  correlation  between  the  ratings  given  by  u  and  v  on  common  items. 
Besides  being  limited  to  numerical  ratings,  this  approach  has  another  significant 
problem:  similarities  can  only  be  evaluated  for  users  that  have  rated  common 
items,  and  the  correlation  values  are  only  significant  if  there  is  a  sufficient  number 
of  common  items.  For  these  reasons,  the  correlation  approach  gives  poor  results 
when  the  rating  data  is  sparse. 

As  in  SimRank ,  our  approach  overcomes  these  limitations  by  using  all  the 
ratings  given  by  a  and  v ,  not  only  those  given  to  common  items.  Thus,  we 
evaluate  the  similarity  between  users  u  and  v  as  the  average  rating  agreement 
for  all  pairs  of  rated  items,  weighted  by  the  similarity  of  these  items: 

s(u.v)  =  t}—  ]T  s(i,j)  k{rui,rvj), 

/juv  ;ez,.  je z„ 


(4) 
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where  k  is  a  function  that  evaluates  the  agreement  between  two  (possibly  non- 
numerical)  ratings,  and  Zuv  is  a  normalization  constant,  for  instance,  Zuv  — 
|7U||7,,|.  Examples  of  agreement  function  A:  for  numerical  ratings  are  the  Radial 
Basis  Function  (RBF)  Gaussian  kernel 

A’RBfO^uj*  F|(j)  =  exp {  {f'ui  t’v j )  /' y  },  (T>) 


where  7  controls  the  width  of  the  kernel,  and  the  Correlation  kernel 


A-Oor  (f  ui '  r  jf J  )  — 


(*«i  -  ru)(r„j  -  rv) 


<7,, 


(6) 


f1t  and  ou  being  the  mean  and  standard  deviation  of  the  ratings  given  by  u.  Note 
that  A*  does  not  need  to  be  semi-definite  positive  (SDP),  and  the  term  kernel  is 
used  in  a  more  general  way  to  represent  a  function  measuring  similarity. 

A  benefit  of  this  formulation  is  that  the  agreement  between  two  ratings  is 
abstracted  in  function  A\  which  can  be  tailored  to  model  specific  characteristics 
or  constraints  of  the  system,  as  well  as  to  measure  the  agreement  between  any 
rating  types.  Moreover,  this  formulation  can  be  easily  extended  to  include  prior 
information  on  the  similarity  between  users  u  and  v.  obtained,  for  example1,  by 
comparing  their  profiles  (gender,  age ,  etc.).  Denote  s(u,v)  the  a  prior  i  similarity 
capturing  this  information.  (4)  can  be  extended  to  include  s(u.  v)  as 


s(a.r)  =  (l-n)s(u.r)  +  ~ —  £  ]T  s {i.j)  k(rui,rvj),  (7) 

/juv  ieiujciv 


where  a  6  [0.  1]  controls  the  importance  of  the  a  priori  similarity  in  the  compu¬ 
tation.  Likewise,  the  similarity  s (i ,  j )  between  two  items  /,  j  £  X  can  be  modeled 
as 

s{i.j)  =  (1  -  a)s(i,j)  +  y-  Y.  Y.  »Cu.v)  &(rw-,rt!j),  (8) 

/Jl3  lieu \  v€U, 

w  here  s(i9j)  models  prior  knowledge  on  the  similarity  between  i  and  j .  for  in¬ 
stance,  their  content  similarity,  and  Zt}  lias  the  same  role  as  Zuv. 

3.2  Modeling  Similarities  as  a  Linear  System 

The  relations  between  similarity  values,  as  defined  by  equations  (7)  and  (8), 
form  a  linear  system  which  can  be  described  usin^  a  matricial  notation.  Denote 
the  user  and  item  similarities  as  vectors  x  £  R  and  y  £  R^J  such  that  cat'll 
pair  of  users  n,  v  is  mapped  to  a  unique  element  xp4l,)  =  s(u<  v),  and  each  pair 
of  items  ;,  j  maps  to  a  unique  element  y(?;)  =  Also,  let  c  £  R1^'  and 

d  £  R  1  "be  vectors  such  that  C(,n,)  =  s(?i,  v)  and  d^j)  =  s  (i.j).  Moreover, 
define  A  as  the  (\U\2  x  \I\Z)  matrix  such  that  ^(Mt.)(0*)  =  fr(7W»7i,j),  if  i  £  Tu 

and  j  £  Xv.  and  ^(Uv)(aj)  =  0  otherwise.  Likewise,  let  B  is  a  (|X|~  x  \U\2)  matrix 
such  that  S(iJ*)(14V)  =  -^k(ru,,rvj)  if  n  £  U ,  and  v  £  Ur  and  B(ij)(uv)  =  0 
otherwise. 
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The  linear  system  formed  of  equations  (7)  and  (8)  can  thus  be  written  in 
matrix  form  as 


x 


7>  '(,-n) 


+  a 


0 


B 


and  has  the  following  solution: 


7*  =  (1'a) 


'  I 

— a/l 

—aB 

I 

-1 


d) 


A\  (x 

A 

\y 

)' 

/r1 

ctAS~l 

aBR  1 

s  1 

(9) 


dl’  (10) 


where  /?  =  (/  —  a2 AB)  and  S  =  (/  —  a2/?A). 

3.3  Computing  the  Similarities 

Although  A  and  B  may  be  very  sparse  matrices,  their  large  size  can  render 
difficult  the  direct  computation  of  R~l  and  S~l.  A  more  efficient  approach 
consists  in  using  an  iterative  method  based  on  the  von  Neumann  series  expansion 
of  these  matrices  [11,13]: 


iT1  =  V  (a2AB)n  and  S~l  =  £  ( a2BA)n . 

n— 0  n — 0 

The  solution  for  x  can  therefore  be  expressed  as 
x  =  (1  -  a)  (  £  (a2AB)nc  4-  t\A  £  (a2BA)nd  J  =  (  £  {a2 AB)"  )  p. 

\71  =  0  73  =  0  J  \T7  =  0  / 

(11) 

where  p  =  (1  —  a)  (c  -f  a  Ad).  Using  the  same  approach  y  is  obtained  as 


y  -  rEtfBArjq, 


(12) 


where  q  —  (1  —  a)  (aBc  +  d). 

This  new  formulation  leads  to  a  simple  method  to  compute  x  and  y  Since  a 
similar  approach  can  be  used  for  y ,  we  limit  our  presentation  to  the  computation 
of  x.  First,  the  method  initializes  x  to  the  null  vector  and  initializes  a  temporary 
vector  w  to  p.  Then,  the  following  two  steps  are  repeated  until  convergence  or 
a  maximum  number  of  iterations  is  reached: 

1.  Update  the  similarities  vector:  x  <—  x  +  w, 

2.  Update  the  temporary  vector:  w  <—  a2ABw. 

Theorem  1.  Denote  by  Amax  the  largest  eigenvalue  of  matrix  AB,  also  known  as 
its  speetral  radius.  The  iterative  method  presented  above  converges  if  a2 1  Amax  |  <  1 . 


Proof.  Let  AVI  A"  1  be  the  eigen-decomposition  of  matrix  AB.  At  the  ?j-th  iter¬ 
ation.  we  have 
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||(a^Z?)"||  =  ||A'(n2yl)" A'-1 1|  <  ||A||  •  •  II*  'll- 

If  tt2|Aniax|  <  1  then  ||(a2AJ3)n||  will  converge  to  0  as  n  approaches  infinity.  As 
a  consequence,  x  will  converge  to  a  fixed  value. 


To  analyze  the  complexity  of  this  approach,  as  observed  in  most  recommender 
systems,  we  suppose  the  number  of  ratings  given  by  any  user  to  be  bounded  by  a 
constant  rn  independent  of  the  number  of  items.  Since  ^(«t>)(ij)  is  non-zero  only 
if  i  G  Tn  and  j  G  Tv ,  assuming  an  even  distribution  of  ratings  among  the  users 
and  items,  t  he  expected  number  of  non-zero  values  in  A  is  given  by 


]WyM2 

2  vmJ 


l^l2  N)2 

2 


6  0(|4/|2). 


Likewise,  we  find  the  expected  number  of  non-zero  elements  of  B  to  be  in  0(\U\2). 
Moreover,  because  the  method  has  to  store  the  non-zero  values  of  A  and  B,  as 
well  as  the  values  of  possibly  dense  vectors  x  and  p,  the  expected  space  complex¬ 
ity  of  the  method  is  0(\U\2).  For  the  time  complexity,  the  dominant  operations 
are  the  two  matrix  multiplications:  Bw  =  wf  and  Aw1.  Since  the  complex¬ 
ity  of  these  operations  is  proportional  to  the  number  of  non- zero  elements  in 
the  multiplying  matrices,  the  total  expected  time  complexity  of  the  method  is 
0(nn»ax|^|2)i  where  umax  is  the  maximum  number  of  iterations  made  by  the 
method.  While  nmax  largely  depends  on  the  normalization  constants  Zuv  and 
as  well  as  on  the  link  agreement  function  A*,  in  our  experiments,  t  he  method 
would  normally  take  5  to  10  iterations  to  converge. 


3.4  Solving  without  Prior  Information 

Although  it  is  always  possible  to  use  default  values  for  c  and  d.  for  instance 
=  1  if  u  =  v  and  0  otherwise,  the  approach  proposed  in  this  paper  could 
also  be  used  without  such  information.  The  following  theorem  explains  how  this 
can  be  done. 


Theorem  2.  Let  G  be  a  (Hrccied  weighted  bipartite  graph  constructed  .such  that 
each  pair  of  users  u,  v  corresponds  to  a  node  (uv)  from  the  first  set  of  nodes , 
each  pair  of  items  i ,  j  is  a  node  ( tj )  from  the  second  set,  and  whose  adjacency 


matrix  is 


adj(G) 


If  a  =  I.  A.B  are  non-negative  matrices  and  G  is  connected,  then  vectors  x 
and  y  eonrspond,  respectively,  to  the  unique  eigenvectors  of  matrices  AB  and 
BA  associated  with  the  largest  eigenvalue  of  these  matrices.  Moreover,  these 
eigenvectors  can  be  computed  using  a  power  iteration  method  [7], 


Proof.  Suppose  we  constrain  x  and  y  to  a  specific  length,  for  instance  ||x||  = 
||y||  =  1  then  equations  (7)  and  (8)  can  be  expressed  as  x  —  ~Ay  and  y  =  £ Bx , 
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where  uj  and  a  are  normalization  constants.  Inserting  the  second  one  into  the 
first,  we  get  (<tuj)x  =  ABx  and,  thus,  x  is  an  eigenvector  of  AB  corresponding 
to  the  eigenvalue  A  =  aw.  Likewise,  y  is  an  eigenvector  of  BA  corresponding  to 
the  same  eigenvalue. 

Furthermore,  since  A  and  B  are  non- negative,  so  are  matrices  AB  and  BA. 
Also,  because  G  is  connected,  and  since  >  0  if  and  only  if  S(ij)(«v)  >  lb 

G  is  also  strongly  connected.  Consequently  the  graph  with  node  set  Ur  and 
adjacency  matrix  AB,  and  the  grapli  with  node  set  X2  and  adjacency  matrix  BA 
are  also  strongly  connected.  This,  in  turn,  is  equivalent  to  saying  that  AB  and 
BA  are  irreducible  matrices.  Finally,  since  AB  and  BA  are  square,  non-negative, 
irreducible  matrices,  by  the  Perron- Frobenius  theorem  on  non-negative  matrices, 
the  cigenspace  corresponding  to  t  lie  eigenvalue  Amax  of  largest  magnitude  is  of 
dimension  one  and  contains  an  eigenvector  whose  components  are  all  positive. 
Running  two  parallel  power ■  iteration  methods  on  matrices  AB  and  BA  will 
therefore  converge  to  the  unique  positive  eigenvectors  of  AB  and  BA ,  associated 
to  Amax  [7].  The  convergence  of  this  method  is  geometric  with  respect  to  < 

I,  where  A'max  is  the  eigenvalue  of  second  largest  magnitude. 

Following  Theorem  2.  the  similarity  values  can  be  computed  by  repeating  the 
following  two  steps  until  convergence: 

1.  Update  the  normalized  user  similarities:  x  <—  Ay  /  ||At/||, 

2.  Update  the  normalized  item  similarities:  y  <—  Bx  /  ||Bx||. 

Once  again,  this  approach  usually  converges  within  a  few  iterations  and  the 
complexity  of  each  iteration  is  reduced  by  the  fact  that  matrices  A  and  B  are 
normally  quite  sparse. 

4  Experimental  Evaluation 

In  this  section,  we  evaluate  our  approach  on  the  task  of  predicting  the  ratings 
of  users  for  movies  and  jokes.  As  it  is  tailored  to  compute  similarities  in  sparse 
data,  and  not  specifically  to  predict  ratings,  it  should  be  recognized  that  our 
approach  is  not  directly  comparable  with  state-of-the-art  methods  for  this  task. 
Yet,  evaluating  our  approach  on  this  problem  still  provides  valuable  informa¬ 
tion,  as  it  allows  us  to  measure  the  quality  of  its  computed  similarities.  To  this 
end,  we  compare  the  similarities  obtained  by  our  method  with  those  computed 
with  correlation-based  and  SVD  methods,  in  the  nearest- neighbor  prediction  of 
ratings.  Since  all  three  types  of  similarities  use  the  same  approach  to  predict 
ratings,  more  accurate  predictions  indicate  more  relevant  similarity  values. 

4.1  Tested  Methods 

In  our  experiments  we  compared  three  methods  to  compute  similarities.  The 
first  one,  called  ESR  (Enhanced  Sint  Rank),  is  the  approach  described  in  this 
paper.  For  these  experiments,  we  used  Zuv  —  |Xw||Xi,|  and  Ztj  =  \Ul\\UJ\  as 
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normalization  constants  and  the  Gaussian  RBF  kernel  of  (5)  with  y  =  0.05 
as  the  rating  agreement  function.  However,  this  kernel  was  used  in  a  slightly 
different  way  for  matrices  A  and  B.  Thus,  for  A.  the  kernel  was  computed  on 
the  normalized  ratings  (rw1  -  ru)/(rniax  —  r mj„),  where  ru  is  the  average  rating 
given  by  user  a  and  rmm:  rmax  are  the  minimum  and  maximum  values  of  the 
rating  range.  For  B ,  however,  the  kernel  was  computed  on  ratings  normalized  as 
(r„i  — ?J/(rmax  —  rmin)i  where  rt  is  the  average  rating  given  to  item  i.  Finally,  we 
used  a  =  0.95  as  the  blending  factor  and  defined  the  a  prion  similarity  values  as 


$(u,v)  (resp. 


J  1.0,  if  u  =  v  (resp.  i  =  j ), 
0.1,  otherwise. 


These  parameter  values  were  selected  based  on  cross-validation. 

The  second  method,  demoted  by  PCC,  is  the  Pearson  correlation  similarity. 
Following  the  literature  (e.g.,  see  16]).  we  computed  user  similarities  as 


s(u,v) 


E  (r« 

«€I«, 


1  m  I  v  ) 


E  (f... -r,,)2  Y  (r,.,  —  r„)2 


(13) 


and  the  item  similarities  as 


E  (7’ui  ri)(l'u  j  rj) 

j 

E  {rni-Ti)2  E  (*•«>- fj)2 


(14) 


Finally,  the  third  method,  called  SVD.  is  based  on  the  decomposition  of  the 
rating  matrix.  Like  the  approach  described  in  r17],  we  represented  each  user 
u  by  a  vector  pu  £  W  and  each  item  by  a  vector  q,  £  W ,  where  /  is  the 
dimensionality  of  the  latent  space.  Vectors  pu  and  q>  were  then  learned  from  the 
data  by  solving  the  following  problem: 

min  E  PuQi)2  s.l.  ||p„||  =  ||<7,||  =  1,  Vu  e  U,  Vi  e  I.  (15) 

p  q  Ckv 

where  zul  —  (rHj  —  Fj)/(rm ax  ~  rmin).  This  problem  corresponds  to  finding,  for 
each  user  u  and  item  /.  coordinates  on  the  surface  of  the  /-dimensional  unit 
sphere  such  that  u  will  give  a  liigli  rating  to  ?  if  their  coordinates  are  close 
together  on  the  surface.  If  two  users  u  and  v  are  nearby  oil  the  surface,  then 
they  will  give  similar  ratings  to  the  same  items,  and,  thus,  the  similarity  between 
these  users  ran  be  computed  as  s(u ,  v)  —  pu  pv.  Likewise,  the  similarity  between 
two  items  i  and  j  can  be  obtained  as  s(i.j)  —  q}  q}.  Based  on  cross-validation, 
we  have  used  /  =  50  in  our  experiments. 

The  similarities  obtained  with  these  three  methods  were  used  to  predict  rat¬ 
ings  vuj  in  two  different  ways.  In  the  first  approach,  called  user-based  prediction 
[12],  the  K  nearest- neighbors  of  a  that  have  rated  ?,  denoted  by  are 
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found  with  the  users  similarities.  The  ratings  of  these  users  for  i  are  then  used 
to  predict  rul*  as 


Vu  ^  6‘(u,  i?)  •  (j'vi  f'v) 

v£j\fj  (u) 


/  E 

vGJ\fi [(«) 


(16) 


The  second  approach,  known  as  item-based  prediction  [4],  instead  uses  the  item 
similarities  to  find  the  A  nearest- neighbors  of  item  i  that  have  been  rated  by 
denoted  by  J\Tu(i),  and  predicts  ratings  as 

rui  =  n  +  E  S(*J)  ■  (ruj  -rj)  /  E  (I7) 

j€Aru(i)  j€J^u(i) 

In  the  experiments  presented  in  this  section,  we  used  K  =  50  as  the  number  of 
nearest- neighbors  considered  in  the  prediction. 

4.2  Benchmark  Datasets 

We  tested  the  prediction  approaches  on  three  different  real-life  datasets,  Movie- 
Lens1 ,  Netjlix 2  and  Jester  *,  coining  from  systems  recommending  movies  and  jokes. 
The  properties  of  these  datasets  are  given  in  Table'  1.  Compared  to  the  other  two,  the 
Jester  dataset  is  particularly  dense,  with  410.000  ratings  per  joke  on  average.  This 
dataset  also  differs  from  the  others  by  the  fact  that  its  rating  scale  is  continuous. 

Table  1.  Properties  of  the  benchmark  datasets 


Dataset 

Type 

Nb.  users 

Nb.  items 

Nb.  ratings 

Rating  range 

MoineLens 

Movies 

6,040 

3,952 

1  M 

{1,2.3.  4,5} 

Netjlix 

Movies 

480,189 

17,770 

100  M 

{1,2.3,4.5} 

Jester 

Jokes 

72,421 

100 

4.1  M 

[-10, 10] 

To  generate  datasets  of  various  sparsity  levels,  we  randomly  selected  5,000 
users  from  the  Netjlix  and  Jester  datasets,  and  discarded  the  ratings  that  were 
not  made  by  these  users  (the  ratings  of  the  MovieLens  dataset  were  all  kept). 
Then,  for  all  three  datasets,  we  sub-sampled  the  ratings  of  the  remaining  users 
by  randomly  selecting  a  user  u  €  U  with  a  probability  proportional  to  \XU\  and 
randomly  removed  one  of  its  ratings  from  Xu.  We  repeated  this  sub-sampling 
process  until  \U\  x  pu  ratings  were  left,  where  pu  is  the  desired  average  number  of 
ratings  per  user.  To  avoid  having  users  with  too  few  ratings,  however,  we  allowed 
removing  a  rating  from  user  u  only  if  \XU\  >  0.5 xpu-  Using  an  average  number 
of  ratings  pu  of  5.  10,  15  and  20,  we  obtained  with  this  approach  four  subsets 
for  each  of  the  Movie  Lens.  Netjlix  and  Jester  datasets.  Note  that,  although  the 
Movie  Lens  and  Netjlix  datasets  contain  information  on  the  users  and  movies, 

1  http://uww.grouplens.org/ 

2  http : //wvw . netf lixprize . com/ 

5  http : //wvw . ieor . berkeley . edu/~ goldberg/ jester-data/ 
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as  well  as  timestamps  indicating  when  the  ratings  were  made,  we  did  not  take 
such  information  into  account  in  these  experiments. 

To  assess  t  he  performanc  e  of  these  strategies,  we  used  a  10-fold  cross-validation 
scheme,  where  the  dataset  T>  was  randomly  split  in  10  equal  sized  subsets 

XV,  A*  =  1 . 10.  For  each  k ,  we  used  (J;  ^Vj  to  compute  the  user  and  item 

similarities  (training  phase)  and  then  evaluated  the  Mean  Absolute  Error  (MAE) 
and  the  Root  Mean  Squared  Error  (RMSE)  on  subset  2 The  reported  error 
values  were  taken  as  the  mean  errors  over  all  10  subsets. 


MovieLens  Data  Subsets 


Pu 

Result 

USER- 1 
PCC 

BASED  PR  ED 
SVD 

1CTION 

ESR 

1TEM-L 

PCC* 

BASED  PRED 
SVD 

1  C  l  ION 

ESU 

5 

MAE 

RMSE 

#NN 

0.934  (.011) 
1.230  (.012) 
0.7 

0.870  (.014) 
1.156  (.017) 
24.5 

0.854  (.011) 

1  128  (  012) 
24.5 

0.857  (.018) 
1.134  (  017) 
0.0 

0.883  (.018) 
1.102  (.017) 
4.2 

0.811  (.011) 
1.076  (.012) 
4.2 

10 

MAE 

RMSE 

#NN 

0.897  (.009) 
1.170  (.009) 
8.2 

0.798  (.010) 
1.000  (.010) 
34.1 

0.783  (.009) 

1,030  (Oil) 
34.1 

0.860  (.004) 
1.133  (.007) 
3.9 

0.832  (.008) 
1.090  (.011) 
10.4 

0,751  (.010) 

1  005  (.012) 
10.4 

15 

MAE 

RMSE 

#NN 

0.841  (.008) 
1.104  (.010) 
19.7 

0.770  (.006) 
1.033  (.009) 
38.7 

h  0.762  (.010) 

1.000  (.010) 
38.7 

0.8 18/. 0081 

1.079  (.010) 
8.1 

0.803  (.007) 

1.061  (.007) 
15.0 

0.735  (7008/ 

0.970  (.010) 
15.6 

20 

MAE 

RMSE 

#NN 

0.807  (.007) 
1.003  (.000) 
29.3 

0.773  (.005) 
1.027  (.007) 
41.4 

0.753  (.006) 
0.991  (.005) 
41.4 

0.785  (.007) 
1.039  (.007) 
13.3 

0.786  (.010) 
1.038  (.011) 
21.1 

0.723  (.000) 

0  905  (  .007) 
21.1 

Netflix  Data  Subsets 


Result 

USEU-l 

PCC 

BASED  PRED 
SVD 

ICTION 

ESR 

1TEM-I 

PCC 

BASED  PRED 
SVD 

ICTION 

ESR 

•> 

MAE 

RMSE 

#NN 

0.914  (.016) 

1.210  (.021) 
0.5 

0.890  (.020) 
1.190  (.021) 
18.7 

0.877  (.020) 
1.160  (.022) 
18.7 

0.929  (.012) 

1.220  (.015) 
0.4 

0.960  (.019) 
1.247  (.021) 
4.3 

r  0.881  (.011) 

1.164  (.010) 
4.3 

10 

MAE 

RMSE 

#NN 

0.890  (.013) 
1.170  (.014) 
5.2 

0.845  (.007) 
1.117  (.007) 
20.9 

0.811  (.011) 
1.081  (.012) 
20.9 

0.920  (.010) 
1.213  (.011) 
2.5 

0.894  (.013) 
1.171  (.012) 
10.0 

"  0.819  (.007) 

1.086  (  010) 
10.0 

15 

MAE 

RMSE 

#NN 

0.807  (.008) 
1.134  (.011) 
12.9 

0.832  (.OlTT 

1.102  (.011) 
31.0 

" 0.790  (.011)" 

1.055  (.011 ) 
31.0 

0  893  (.010) 
1.175  (.011) 
5.5 

0.807  (.008) 
1.137  (.010) 
15.9 

0  792  (.011) 
1.058  (013) 
15.9 

20 

MAE 

RMSE 

#NN 

0.839  (.007) 
1.103  (.009) 
20.5 

0.824  (.007) 
1.090  (.007) 
33.7 

0.770  (.008) 
1.037  (.009) 
33.7 

0.800  (.006) 
1.138  (.005) 
9. '2 

0.848  (.005) 
1.1  U  (.008) 
21.1 

0,776  (.005) 
1.039  (.000) 
21.4 

Jester  Data  Subsets 


Pu 

Result 

USER  1 
PCC 

BASED  PREDI 
SVD 

ICTION 

ESR 

1TEM-P 
PC  C 

IASED  PREDI 
S\  D 

ICTION 

ESR 

5 

MAE 

RMSE 

#NN 

1.076  (.072) 

5.191  (.081) 
39.5 

3.940  (.050) 

5.0G3  (.079) 
50.0 

3.890  (.061) 

5.017  (.073) 
50.0 

4.060  (.047) 
5.212  (.065) 
4.1 

4.714  (.083) 

5.953  (.109) 
4.1 

3.809  :  058) 
4.891  (.069) 
l  1 

10 

MAE 

RMSE 

#nn 

3.710  (.059) 
4.095  (.007) 
50.0 

3.675  (.053) 

4.702  (.054) 
50.0 

3.655  (.062) 
4.651  (074) 
50.0 

3.588  (.052) 
4.587  (.009) 
9.1 

045  (.042) 
5.410  (,041) 

9  1 

3.603  (.055) 
4.592  (.067) 
9.1 

15 

MAE 

RMSE 

#NN 

3.005  (.035) 
4.017  (.049) 
50.0 

3.571  (.029) 
4.5(57  (.032) 
50.0 

3.581  (.038) 

4.538  (.049) 
50.0 

3.476  (.039) 
4.434  (.050) 
13.9 

4.193  (  .038) 
5.184  (.038) 
13.9 

3.539  (.039) 

4.493  (.051) 
13  9 

20 

MAE 

RMSE 

#NN 

3.034  (.018) 

■1.568  (.027) 
50.0 

3.505  (.019) 

4.490  (.024) 
50.0 

3.541  (.015) 

4.480  (.025) 
50.0 

3.431  (.020) 
4.305  (.028) 
18.9 

4  143  (.031) 
5.105  (.032) 
18.9 

3.511  (.018) 

4.414  (.027) 
18.9 

Fig.  2.  Average  MAE  and  RAISE  (and  corresponding  standard  deviation)  obtained  for 
the  MovieLens ,  Netflix  and  Jester  data  subsets,  with  an  average  number  of  ratings 
per  user  E  {5.  10.  15/20}.  #NN  gives  the  average,  number  of  neighbors  used  in  the 
predictions. 


50 


C.  Desrosiers  and  G.  K  ary  pis 


4.3  Prediction  Results 

Figure  2  presents  the  results  for  the  six  rating  prediction  methods  on  the  Movie- 
Lens,  Netflix  and  Jester  data  subsets.  The  lower  the  MAE  and  RMSE  values, 
the  more  accurate  are  the  methods  at  predicting  ratings.  Moreover,  the  #NN 
values  give  the  average  number  of  neighbors  used  in  the  predictions.  A  low  value 
indicates  that  a  significant  portion  of  the  user  or  item  similarities  are  equal  to 
zero,  due  to  data  sparsity. 

From  these  results,  we  can  see  that  the  similarity  values  obtained  by  our 
method  leads  to  more  accurate  predictions  than  those  of  the  SVD  method,  even 
though  these  predictions  were  made  with  the  same  number  of  neighbors.  More¬ 
over,  compared  to  PCC,  our  method  also  leads  to  better  results  on  the  sparser 
datasets  Movie  Lens  and  Netflix.  However,  in  the  denser  Jester  dataset,  PCC 
similarities  produce  more  accurate  predictions  for  pu  =  15  and  pu  =  20.  Even 
though  we  have  used  only  a  sub-sample  of  the  ratings,  one  should  note  that 
the  Jester  data  subsets  tested  in  our  experiments  are  still  very  dense.  Thus,  for 
Pu  =  15,  users  still  have  rated  on  average  15%  of  the  jokes.  Nevertheless,  the 
result  of  this  experiments  seem  to  indicate  that  our  method  provides  better  sim¬ 
ilarity  values  when  the  data  is  sparse,  but  correlation  based  approaches  might 
be  superior  when  a  large  number  of  ratings  is  available. 


5  Summary  and  Future  Works 

This  paper  presented  a  novel  approach  to  compute  similarities.  Like  SimRank ,  our 
approach  uses  a  formulation  that  associates  similarities  between  linked  objects  of 
two  different  sets.  However,  our  approach  also  allows  one  to  model  the  agreement 
between  link  weights  using  any  desired  function  and  provides  an  elegant  way  to 
integrate  prior  information  on  the  similarity  values  directly  in  the  computations. 

To  illustrate  its  usefulness,  we  have  described  how  this  approach  can  be  used 
to  evaluate  the  similarities  between  the  users  or  the  items  of  a  recommender 
system,  based  on  the  ratings  of  users  oil  items.  In  contrast  to  the  traditional 
methods  using  rating  correlation,  our  approach  has  the  benefit  of  considering 
all  the  available  ratings  made  by  two  users,  making  possible  the  computation  of 
similarities  between  users  that  have  rated  different  items.  Also,  as  opposed  to 
more  recent  recommendation  methods  this  approach  is  not  limited  to  numerical 
ratings  and  provides  a  simple  way  to  integrate  information  on  item  content  or 
user  profile  similarity.  Finally,  experiments  conducted  on  the  problem  of  predict¬ 
ing  new  ratings  on  three  different  real-life  datasets  have  shown  the  similarities 
obtained  with  our  approach  to  lead  to  more  accurate  predictions  than  those  ob¬ 
tained  by  two  other  methods  based  on  Pearson  correlation  and  on  SVD,  when 
the  data  is  sparse. 

In  future  works,  we  would  like  to  deeper  investigate  the  impact  of  using  prior 
knowledge  on  the  similarities,  for  instance,  obtained  from  user  profiles  and  item 
content.  Moreover,  we  also  consider  defining  and  evaluating  other  types  of  rating 
agreement  functions,  in  particular,  in  the  setting  where  ratings  are  noil- numerical. 
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Abstract.  In  this  paper,  we  propose  a  multiobjective  optimization 
(MOO)  based  technique  to  determine  the  appropriate  weight  of  voting 
for  each  class  in  each  classifier  for  Named  Entity  Recognition  (NER). 
Onr  underlying  assumption  is  that  reliability  of  predictions  of  each  clas¬ 
sifier  differs  among  the  various  named  entity  (NE)  classes.  Thus,  it  is 
necessary  to  quantify  the  amount  of  voting  for  each  class  in  a  particu¬ 
lar  classifier.  \Vc  use  Maximum  Entropy  (ME)  as  the  base  to  generate  a 
number  of  classifiers  depending  upon  the  various  feature  representations. 
The  proposed  algorithm  is  evaluated  for  a  resource-const  rained  language 
like  Bengali  that  yield  the  overall  recall,  precision  and  E-measure  values 
of  79.08%,  82.24%  and  81.10%,  respectively.  Experiments  also  show  that 
the  classifier  ensemble  identified  by  the  proposed  umlt iobjective  based 
technique  outperforms  all  the  individual  classifiers,  three  different  con¬ 
ventional  baseline  ensembles  and  an  existing  single  objective  optimization 
based  approach. 


1  Introduction 

Named  Entity  Recognition  (NER)  is  an  important  pipelined  module  in  many 
Natural  Language  Processing  (NLP)  application  areas  such  as  information  ex¬ 
traction  [1],  machine  translation  [2],  question  answering  [3]  and  automatic  sum¬ 
marization  [4]  etc.  Named  Entity  (NE)  identification  in  Indian  languages  in 
general  and  Bengali  in  particular  is  more  difficult  and  challenging  compared  to 
English,  most  of  the  European  languages  and  some  of  the  Asian  languages  such 
as  Chinese,  Japanese  and  Korean.  The  difficulties  lie  with  some  of  the  facts 
such  as:  (i).  missing  of  capitalization  information,  (ii).  appearance  of  NEs  in  the 
dictionary  with  some  other  specific  meanings,  (iii).  free  word  order  nature  of 
the  languages  and  (iv).  resource-constrained  environment,  i.e.,  non-availability 
of  corpora,  annotated  corpora,  name  dictionaries,  good  morphological  analyz¬ 
ers,  part  of  speech  (POS)  taggers  etc.  in  the  required  measure.  Thus,  developing 
reasonably  high  accurate  NE  taggers  for  such  resource- poor  languages  is  a  big 
challenge. 

The  first  two  authors  are  the  joint  first  authors. 
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hi  the  area  of  machine  learning,  the  concept  of  combining  (or,  ensenibling) 
classifiers  has  drawn  much  at  tention  to  the  researchers  during  the  last  few  years 
with  the  aim  of  achieving  better  performance  in  comparison  to  the  individual 
classifiers,  that  could  be  of  homogeneous  or  heterogenous  types.  Feature  selec¬ 
tion  is  also  a  very  crucial  issue  in  machine  learning.  In  the  present  work,  we 
assume  that  rather  than  searching  for  the  best-fitting  feature  set,  ensenibling 
several  homogenous  NER  systems  where  each  one  is  based  on  different  feature 
representation  could  be  more  effective.  But.  the  selection  of  appropriate  subset 
of  classifiers  is  very  crucial.  Moreover,  all  the  classifiers  arc  not  good  to  detect  all 
types  of  NE  classes.  For  ensenibling  the  outputs  of  all  classifiers,  either  majority 
voting  or  weighted  voting  is  used.  In  case  of  weighted  voting,  weights  should 
vary  among  the  various  NE  classes  in  each  classifier.  The  weight  of  a  particular 
classifier  should  be  high  for  that  particular  NE  class  for  which  it  performs  good. 
Otherwise,  weights  should  be  low  for  the  NE  classes  for  which  its  outputs  are 
not  very  reliable.  Some  single  objective  optimization  techniques  like  genetic  al¬ 
gorithm  (GA)  [5]  can  be  used  to  determine  the  appropriate  weight  combinations 
per  classifier  [()].  This  single  objective  optimization  technique  can  only  optimize 
a  single  quality  measure,  e.g.,  recall,  precision  or  F-measure  at  a  time.  But, 
sometimes  a  single  measure  cannot  capture  the  quality  of  a  good  ensenibling 
reliably.  A  good  weighted  vote  based  ensemble  should  have  its  all  the  param¬ 
eters  optimized  simultaneously.  In  order  to  achieve  this,  we  use  multiobjective 
optimization  (MOO)  [7]  that  is  capable  of  simultaneously  optimizing  more  than 
one  classification  quality  measures.  Experimental  results  also  justify  that  MOO 
performs  superior  compared  to  the  single  objective  optimization  for  NER.  We 
use  ME  framework  as  a  base  classifier.  Depending  on  the  various  combinations 
of  the  available  features,  different  versions  of  this  classifier  are  made.  These  fea¬ 
tures  are  language  independent  in  nature,  and  can  be  derived  for  almost  all  the 
languages  with  a  very  Little  effort. 

The  proposed  technique  is  evaluated  for  a  resource  constrained  language  like 
Bengali.  Ill  terms  of  native  speakers,  Bengali  is  tli c  fifth  popular  language  in  the 
world,  second  in  India  and  the  national  language  in  Bangladesh.  We  manually 
annotate  approximately  250K  word  forms  that  were  randomly  selected  from  a 
portion  of  the  Bengali  news  corpus  [8],  developed  from  the  archive  of  leading 
newspaper  available  in  the  web.  In  addition,  we  also  use  the  IJCNLP-08  NER 
on  South  and  South  East  Asian  Languages  (NERSSEAL)1  Shared  Task  data  of 
around  LOOK  wordfonns.  Evaluation  results  of  our  proposed  method  yield  the  re¬ 
call  precision  and  F-measuro  values  of  79.98%,  82.24%)  and  81.10%!.  respectively. 
Results  also  show  that  the  classifier  ensemble  identified  by  our  proposed  tech¬ 
nique  outperforms  all  the  individual  classifiers,  three  different  baseline  ensembles 
and  a  single  objective  optimization  based  approach  [G]. 

hi  the  literature,  there  exists  some  works  related  to  NER  that  made  use 
of  classifier  combination  techniques.  For  example,  Florian  et  al.  [9]  reported  a 
system  by  combining  four  diverse  classifiers  that  exhibited  best  performance  in 
the  CoNLL-2003  shared  task  [10].  In  Indian  languages,  the  classifier  combination 
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technique  for  NER  has  been  reported  in  Ekbal  and  Bandyopadhyay  [11]  for 
Bengali.  But,  these  two  works  are  based  on  the  multiple  heterogenous  classifiers, 
and  used  more  complex  experimental  set  up  along  with  the  domain  dependent 
resources.  In  contrast,  our  system  (i).  is  based  only  on  the  ME  framework,  (ii). 
makes  use  of  a  small  set  of  features  that  can  be  very  easily  obtained  for  many 
languages  and  (iii).  does  not  make  use  of  domain  dependent  resources,  but  still 
achieves  the  state-of-the-art  performance. 

The  key  contributions  of  our  work  are  listed  below: 

1. A  MOO  based  technique  is  proposed  for  selecting  the  best  weights  to  form  a 
classifier  ensemble2.  We  tried  to  establish  that  such  ensemble  is  capable  to  in¬ 
crease  the  classification  quality  by  a  large  margin  compared  to  the  conventional 
ensemble  methods. 

2.  ME  is  used  as  a  test  classifier  due  to  its  less  computational  overhead.  However, 
the  proposed  method  will  work  for  any  set  of  c  lassifiers,  i.e.  either  homogeneous 
or  heterogeneous.  The  proposed  technique  is  very  general  and  its  performance 
may  further  improve  depending  upon  the  choice  and/or  the  number  of  classifiers 
as  well  as  the  use  of  more  complex  features. 

3.  The  proposed  technique  can  be  replicated  for  any  resource-poor  language  very 
easily  due  to  its  language  independent  nature. 

4.  The  proposed  technique  is  applicable  for  any  type  of  classification  problems 
like  NER,  POS-tagging,  question-answering  etc.  To  the  best  of  our  knowledge, 
use  of  MOO  to  select  appropriate  weights  for  voting  is  a  novel  contribution. 

5.  Note,  that  our  work  proposes  a  novel  way  of  cnsembling  the  available  classi¬ 
fiers.  Performance  of  the  existing  works,  that  are  based  on  ensemble  techniques 
(e.g..  [11],  [9]  etc.),  can  be  further  improved  with  our  proposed  algorithm. 

G.  Another  important  motivation  of  MOO  based  technique  is  to  provide  the 
users  a  set  of  alternative  solutions  with  high  recall  values  or  solutions  with  high 
precision  values  or  solutions  with  moderate  recall  and  precision  values.  Depend¬ 
ing  upon  the  nature  of  problems  or  the  requirement  of  the  users,  appropriate 
solutions  can  be  selected. 

2  Problem  Formulation 

In  this  section,  we  formulate  the  weighted  vote  based  classifier  ensemble  problem 
under  the  MOO  framework.  Let,  the  N  number  of  available  classifiers  be  denoted 
by  C\ , . . . ,  Cn  and  A  —  {Ci  :  i  =  1;  N}.  Suppose,  there  are  M  number  of  output 
classes.  The  weighted  vote  based  classifier  ensemble  selection  problem  is  then 
stated  as  follows: 

Find  the  weights  of  votes  V  per  classifier  which  will  optimize  a  function  F(V). 
Here,  V  is  an  real  array  of  size  N  x  M .  V(i,j)  denotes  the  weight  of  vote  of  the 
ith  classifier  for  the  jth  class.  More  weight  is  assigned  for  that  particular  class 
for  which  the  classifier  is  more  confident,  whereas  the  output  classes  for  which 
the  classifier  is  less  confident  are  given  less  weight.  V(L  j)  E  [0.1]  denotes  the 

2  We  use  ’classifier  ensemble’  and  ’ensemble  classifier’  interchangeably. 
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degree  of  confidence  of  the  ith  classifier  for  the  jth  class.  These  weights  are  used 
while  combining  the  outputs  of  the  classifiers  using  weighted  voting.  Here,  F  is 
a  classification  quality  measure  of  the  combined  weighted  vote  based  classifier. 
The  particular  type  of  problem  like  NER  has  mainly  three  different  kinds  of 
classification  quality  measures,  namely  recall,  precision  and  F-measure.  Thus, 
F  £  {recall  precision.  F- measure}. 

Multiobjective  Formulation .  The  MOO  can  be  formally  stated  as  follows  [7]. 
Find  tin1  vectors  x*  —  . . . ,  .r*]T  of  decision  variables  that  simultaneously 

optimize  the  M  objective  values  {/i (T), /2(r). . . . , /a/ (T)},  while  satisfying  the 
constraints,  if  any. 

Now,  the  weighted  vote  based  classifier  ensemble  selection  problem  under  the 
MOO  framework  takes  the  form  as  follows: 

Find  the  weights  of  votes  per  classifier  V  such  that,  maximize  [/q(V’).  /^(E)]* 
where  F\ ,  F^  £  {recall,  precision,  F-measure}  and  F\  ^  F2  .  We  choose  F\  — 
recall  and  F2  =  precision. 

Selection  of  Objectives .  Performance  of  MOO  largely  depends  on  the  choice  of 
the  objective  functions  which  should  he  as  contradictory  as  possible.  In  t  his  work, 
we  choose  recall  and  precision  as  two  objective  functions.  From  the  definitions,  it 
is  clear  that  while  recall  tries  to  increase  the  number  of  tagged  entries  as  much 
as  possible,  precision  tries  to  increase  the  number  of  correctly  tagged  entries. 
These  two  capture  two  different  classification  qualities.  Often,  there  is  an  inverse 
relationship  between  recall  and  precision ,  where  it  is  possible  to  increase  one  at. 
the  cost  of  reducing  the  other.  For  example,  an  information  retrieval  system  (suc  h 
as  a  search  engine)  can  often  increase  its  recall  by  retrieving  more  documents  at 
the  cost  of  increasing  number  of  irrelevant  documents  retrieved  (i.e.  decreasing 
precision).  This  is  the  underlying  motivation  of  simultaneously  optimizing  these 
two  objectives.  Figure  1  shows,  for  example,  the  Pareto  optimal  front  identified 
by  the  proposed  MOO  approach.  This  again  supports  the  contradictory  nature 
of  these  two  objective  functions. 

Note,  that  F-measnre  is  the  harmonic  mean  (i.e.,  weighted  average)  of  recall 
and  precision.  But,  it  has  been  thoroughly  discussed  in  Chapter  2  of  Ref  [7] 
that  weighted  sum  approach  cannot  identify  all  non-dominated  solutions.  Only 
solutions  located  on  the  convex  part  of  the  Pareto  front  can  he  found.  But  as  dis¬ 
cussed  in  t  he  last  note  of  introduction,  onr  another  important  motivation  of  this 
work  is  to  provide  the  user  a  set.  of  alternative  solutions.  1 1ms,  MOO  is  indeed 
the  best  candidate  to  solve  this  problem.  Here,  no  weight  is  required  to  combine 
the  objectives  (i.e.,  recall  and  precision)  and  thus  no  a  priori  information  on 
the  problem  is  needed. 

Nondominated  Sorting  GA-II.  Our  main  objective  is  to  find  the  appropri¬ 
ate  weights  of  voting  that  will  be  most  suitable  to  form  a  classifier  ensemble, 
hi  order  to  achieve  this  goal,  we  use  a  multiobjective  evolutionary  algorithm, 
namely  Nondominated  Sorting  GA-11  (NSCA-II).  NSGA-II  [12]  is  a  widely  used 
MOO  technique  based  on  GA.  Here,  initially  a  random  parent  population  Pq 
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Fig.  1.  Farcto  optimal  front  of  the  proposed  MOO  based  ensemble  for  the  combined 
classifiers 


is  created  and  the  population  is  sorted  based  on  the  partial  order  defined  by 
the  non-domination  relation.  This  relation  yields  a  sequence  of  nondoininatcd 
fronts.  Each  solution  of  the  population  is  assigned  a  fitness  which  is  equal  to 
its  non-dorriination  level  in  the  partial  order.  A  child  population  Qq  of  size  N 
is  created  from  the  parent  population  Pq  by  using  binary  tournament  selection, 
recombination,  and  mutation  operators.  According  to  this  algorithm,  in  the  tth 
iteration,  a  combined  population  Rt  =  Pi  +Qt  is  formed.  The  size  of  Rt  is  2N .  All 
the  solutions  of  Rt  are  sorted  according  to  non-domination.  If  the  total  number 
of  solutions  belonging  to  the  best  nondommated  set  F\  is  smaller  than  jV,  then 
F\  is  totally  included  in  The  remaining  members  of  the  popnlation  *Vi) 

are  chosen  from  subsequent  nondominated  fronts  in  the  order  of  their  ranking. 
To  choose  exactly  N  solutions,  the  solutions  of  the  last  included  front  are  sorted 
using  the  crowded  comparison  operator  [12]  and  the  best  among  them  (i.e.  those 
with  lower  crowding  distance)  are  selected  to  fill  in  the  available  slots  in 
The  new  population  P<t+  \  is  then  used  for  selection,  crossover  and  mutation  to 
create  a  population  Q(t+i)  of  size  N. 

3  Named  Entity  Features 

We  use  the  following  features  for  constructing  the  various  classifiers  based  on 
the  ME  framework. 

1.  Context  words:  These  are  the  preceding  and  succeeding  words  of  the  current 
word. 

2.  Word  suffix  and  prefix:  Fixed  length  (say.  /?.)  word  suffixes  and  prefixes 
are  very  effective  to  identify  NEs  and  work  well  for  the  highly  inflective  Indian 
language  like  Bengali.  Actually,  these  arc  the  fixed  length  character  sequences 
stripped  from  either  the  rightmost  or  leftmost  positions  of  the  words. 

3.  First  word:  This  is  a  binary  valued  feature  that  checks  whether  the  current 
token  is  the  first  word  of  the  sentence  or  not.  We  consider  this  feature  with  the 
observation  that  the  first  word  of  the  sentence  is  most  likely  a  NE,  especially  iri 
a  newspaper  corpus. 
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4.  Length  of  the  word:  This  binary  valued  feature  checks  whether  the  length 
of  the  token  is  less  than  a  predetermined  threshold  (set  to  5)  value.  We  observed 
that  very  short  words  are  most  probably  not  the  NEs. 

5.  Infrequent  word:  A  list  is  prepared  that  contains  those  words,  having  loss 
than  10  occurrences  in  the  training  data.  A  binary  valued  feature  dNFRQ'  is 
defined  that  fires  if  the  current  word  appears  in  this  list.  We  observed  that  very 
frequently  occurring  words  are  most  probably  not  t  he  NEs. 

6.  Part  of  Speecli  (POS)  information:  POS  information  of  the  current 
and/or  the  surrounding  word(s)  are  extracted  using  a  SVM  based  POS  tag- 
ger  [13].. 

7.  Position  of  the  word:  This  binary  valued  feature  checks  the  position  of  the 
word  in  the  sentence.  Sometimes,  position  of  the  word  in  a  sentence  acts  as  a 
good  indicator  for  NE  identification.  This  feature  fires  if  the  word  is  at  the  last 
position  in  the  sentence. 

8  Digit  features:  Several  digit  features  (digit  .Comma,  digit  Percent  age  etc.) 
arc  defined  depending  upon  the  presence  and/or  the  number  of  digits  and/or 
symbols  in  a  token.  These  feature's  are  helpful  to  identify  miscellaneous  NEs. 

4  Our  Proposed  Method  for  Classifier  Ensemble  Selection 

In  this  section,  we  present  the  classifier  ensemble  selection  problem  with  a  frame¬ 
work  that  is  founded  on  the  principle  of  MOO  algorithm,  namely  NSGA-1I. 

4.1  Chromosome  Representation  and  Population  Initialization 

If  the  total  number  of  available  ( lassi tiers  is  M  and  total  number  of  output  t  ags 
(or,  classes)  is  O.  then  the  length  of  the  chromosome  is  M  x  O  (each  chromosome 
encodes  the  weights  of  votes  for  possible  O  classes  for  each  classifier).  In  the 
present  work,  we  use  real  encoding.  The  entries  of  each  chromosome  arc'  randomly 
initialized  to  a  real  value  (r)  between  0  and  1.  Here,  r  =  7f47v^T5vT\TTl  an 
example,  the  encoding  of  a  particular  chromosome  is  represented  below: 

0.59  0.12  0.56  0.09  0.91  0.02  0.76  0.5  0.21 

Here,  M  3  and  O  —  3  (i.e.,  total  9  votes  can  be  possible),  i  he  c  hromosome 
represents  the  following  voting  ensemble: 

The  weights  of  votes  for  3  different  output  classes  are  0.59,  0.12  and  0.56, 
respectively  for  classifier  1;  0.09,  0.91  and  0.02,  respectively  for  classifier  2;  and 
0.76,  0.5  and  0.21,  respectively  for  classifier  3. 

If  the  population  size  is  P  then  all  the  P  number  of  chromosomes  of  this 
population  are  initialized  in  the  above  way. 

4.2  Fitness  Computation 

Initially,  the  F- measure  values  of  all  the  ME  based  classifiers  are  computed  on  a 
development  set.  Then,  we  execute  the  following  steps  to  compute  the  objective 
values. 
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1)  Suppose,  there  are  total  M  number  of  classifiers.  Let,  the  overall  F-mcasure 
values  of  these  M  classifiers  be  Ft,  i  =  1  . . .  M. 

2)  Each  classifier  is  trained  using  the  training  data  and  tested  with  the  develop¬ 
ment  data.  Now,  for  the  ensemble  classifier  the  output  label  for  each  word  in  the 
development  data  is  determined  using  the  weighted  voting  of  these  M  classifiers1 
outputs.  The  weight  of  the  class  provided  by  the  iih  classifier  is  equal  to  I(m,  i). 
Here,  I(m,  i)  is  the  entry  of  the  chromosome  corresponding  to  mtfl  classifier  and 
ith  class.  The  combined  score  of  a  particular  class  for  a  particular  word  w  is: 

f{ci)  =  y^I(m,i)  x  Fm, 

Vm  =  1  :  M  &  op(w,  m)  —  c3 

Here,  op(w,rn)  denotes  the  output  class  provided  by  the  mth  classifier  for  the 
word  w.  The  class  receiving  the  maximum  combined  score  is  selected  as  the  joint 
decision. 

3)  Now,  the  overall  recall  precision  and  F-measure  values  of  the  ensemble  clas¬ 
sifier  are  computed  on  the  development  set.  The  objective  functions  correspond¬ 
ing  to  a  particular  chromosome  arc  f\  =  recall  and  f-2  =  precision.  The  main 
goal  is  to  maximize  these  two  objective  functions  using  the  search  capability  of 
NSGA-II. 

4.3  Genetic  Operators 

We  use  crowded  binary  tournament  selection  as  in  NSGA-II,  followed  by  con¬ 
ventional  crossover  and  mutation.  The  most  characteristic  part  of  NSGA-II  is 
its  elitism  operation,  where  the  noil-dominated  solutions  [T]  among  the  parent 
and  child  populations  arc  propagated  to  the  next  generation.  The  near-Pareto- 
optimal  strings  of  the  last  generation  provide  the  different  solutions  to  the  en¬ 
semble  problem. 

4.4  Selection  of  a  Solution  from  the  Final  Pareto  Optimal  Front 

In  MOO,  the  algorithms  produce  a  large  number  of  lion-dominated  solutions  [7] 
on  the  final  Pareto  optimal  front.  Each  of  these  solutions  provides  a  weighted 
vote  based  classifier  ensemble.  All  the  solutions  arc  equally  important  from  the 
algorithmic  point  of  view.  But,  sometimes  the  user  may  need  only  a  single  so¬ 
lution.  Consequently,  in  this  paper  a  method  of  selecting  a  single  solution  from 
the  set  of  solutions  is  now  developed. 

For  every  solution  on  the  final  Pareto  optimal  front,  the  F-nicasuro  value  of  the 
weighted  vote  based  classifier  ensemble  for  the  development  set  is  calculated.  The 
best  solution  is  selected  to  be  the  one,  having  the  highest  F-mcasure  value.  Final 
results  on  the  test  data  are  reported  using  the  classifier  ensemble  corresponding 
to  this  best  solution.  There  can  be  many  other  different  approaches  of  selecting 
a  solution  from  the  final  Pareto  optimal  front. 
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5  Experimental  Results  and  Discussions 

We  use  the  OpenNLP  Java  based  ME  package3  for  the  MaxEnt  experiments. 
Model  parameters  are  computed  with  200  iterations  without  any  feature  fre¬ 
quency  cutoff.  We  set  the  following  parameter  values  for  NSGA-I1:  population 
size=l00,  number  of  generations^ 0,  probability  of  imitation =0.2  and  proba¬ 
bility  of  crossover  =  0.9.  Note  that  these  values  are  selected  after  a  thorough 
sensitivity  analysis  of  the.  parameter  values  on  the  performance  of  the  proposed 
system.  We  define  three  different  baseline  ensembles  as  below: 

1.  Baseline  1:  This  is  based  on  the  majority  voting  among  the  classifiers. 

2.  Baseline  2:  This  is  a  weighted  voting  approach.  In  each  classifier,  weights  arc 
calculated  based  on  the  average  F-measure  value  of  the  3-fold  cross  validation 
on  the  training  data. 

3.  Baseline  3:  For  each  classifier,  the  average  F- measure  value  of  each  class  is 
computed  from  the  3-fold  cross  validation  on  the  training  data.  The  weight  of 
any  classifier  is  set  to  the  average  F- measure  value  of  the  corresponding  class 
that  it  assigns  to  a  word. 

5.1  Datasets  for  NER 

Indian  languages  are  resource-constrained  in  nature.  For  NER,  we  use  a  Bengali 
news  corpus  [8],  developed  from  the  archive  of  a  leading  Bengali  newspaper  avail¬ 
able  iii  the  web.  Out  of  34  million  word  forms,  a  portion  containing  approximately 
250K  wordforms  is  manually  annotated  with  a  coarse-grained  NE  tagset  of  four 
tags  namely,  PER  (Person  name),  LOC  (Location  name),  ORG  (Organization 
name)  and  M1SC  (Miscellaneous  name).  The  miscellaneous  name  includes  date, 
time,  number,  percentages,  monetary  and  measurement  expressions.  The  data  is 
collected  mostly  from  the  National  States.  Sports  domains  and  the  various  sub- 
domains  of  District  of  the  particular  newspaper.  This  annotation  was  carried  out 
by  one  of  the  authors  and  verified  by  an  expert.  We  also  use  the  l.JCNLP-08  NER 
on  South  and  South  East  Asian  Languages  (NERSSEAL)4  Shared  Task  data  of 
around  100K  wordforms  that  were  originally  annotated  with  a  fine-grained  tagset 
of  twelve  tags.  This  data  is  mostly  from  the  agriculture  and  scientific  domains. 
An  appropriate  mapping  is  defined  to  convert  the  fine-grained  NE  annotated 
data  to  the  desired  forms  i.e.  tagged  with  a  coarse-grained  tasget  of  four  tags. 
In  order  to  report  the  evaluation  results,  we  randomly  partition  the  dataset  into 
training,  development  and  test  sets  that  contain  approximately  263K.  50K  and 
37K  wordforms,  respectively.  The  number  of  unseen  NEs  in  the  test  set  is  39.5%. 
In  order  to  properly  denote  the  boundaries  of  NEs,  four  basic  NE  tags  are  fur¬ 
ther  divided  into  the  format  1-TYPE  (TYPE^PER/LOC/ORG/MISC)  which 
means  that  the  word  is  inside  a  NE  of  type  TYPE.  Only  if  two  NEs  of  the  same 
type  immediately  follow  each  other,  the  first  word  of  the  second  NE  will  have 
tag  B-TYPE  to  show  that  it  starts  a  new  NE.  This  is  the  standard  10B  format 
that  was  followed  in  the  CoNLL-2003  shared  task  [10]. 


3  Iitt.p: / /maxent. sourceforge.net/ 

4  http://ltrc.iiit.ac.in/ner-ssea-08 
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5.2  Results  and  Discussions 

We  build  a  number  of  different  ME  models  by  considering  the  various  combi¬ 
nations  of  the  available  NE  features.  In  particular,  we  construct  the  classifiers 
from  the  following  set  of  features: 

(i).  considering  the  various  context  size  within  the  previous  three  and  next 
three  words,  (ii).  word  suffixes  and  prefixes  of  length  up  to  three  (343  different 
features)  or  four  (444  different  features)  characters,  (iii).  POS  information  of  the 
current  word,  (iv).  first  word,  (v).  length,  (vi).  infrequent  word,  (vii).  position, 
and  (viii).  digit  features. 

We  generate  152  different  classifiers  by  considering  the  various  combinations 
of  the  available  features.  Some  of  these  classifiers  are  shown  in  Table  1.  The  best 


Table  1.  Evaluation  results  with  the  various  feature  subsets.  Here,  the  following  ab¬ 
breviations  are  used:  CW:Context  words,  PS:  Size  of  the  prefix,  SS:  Size  of  the  suffix, 
\V  L:  Word  length,  IW:  Infrequent  word,  PW:  Position  of  the  word,  FWiFirst  word, 
DI:  Digit  information,  -i,j:  Context  wrords  spanning  from  the  if,i  left  position  to  the  j^' 
right  position  with  the  current  word  at  position  0,  R  recall,  P:  precision  F:F- measure, 
X:  Denotes  the  presence  of  the  corresponding  feature  (we  report  percentages). 


Classifier 

cw 

FW 

PS 

SS 

WL 

IW 

PW 

DI 

POS 

Normal  TVaining 

After  Sampling 

R 

P 

F 

R 

P 

F 

A/31 

-2.1 

X 

3 

3 

X 

X 

- 

X 

X 

71.21 

83.55 

76.88 

81.52 

69.79 

75.20 

A/42 

-2,1 

X 

3 

3 

X 

X 

X 

X 

X 

70.87 

83.73 

76.74 

81.52 

69.85 

75.24 

A/«2 

-2,1 

X 

3 

4 

X 

- 

- 

X 

X 

68.65 

83.54 

75.36 

79.41 

69.38 

74.06 

A/83 

-2  0 

X 

3 

4 

X 

- 

- 

X 

X 

67.54 

82.20 

74.15 

80.18 

69.43 

74.42 

A/85 

-1,2 

X 

3 

4 

X 

- 

- 

X 

X 

68.01 

82.21 

74.44 

80.09 

68.82 

74.03 

A/89 

-2,2 

X 

4 

3 

X 

- 

- 

X 

X 

65.32 

81.89 

72.67 

78.75 

67.77 

72.85 

A/90 

-2  1 

X 

4 

3 

X 

- 

- 

X 

X 

66.24 

82.27 

73.39 

79.21 

68.75 

73.61 

A/92 

-1,1 

X 

4 

3 

X 

- 

- 

X 

X 

68.85 

82.84 

75.20 

78.03 

67.35 

72.30 

A/93 

-1.2 

X 

4 

3 

X 

- 

- 

X 

X 

66.61 

81.67 

73.37 

79.91 

69.06 

74.09 

A/105 

-2,2 

X 

3 

4 

X 

X 

- 

X 

X 

07.40 

82.(57 

74.26 

73.79 

63.02 

67.98 

A/iog 

-2,1 

X 

3 

4 

X 

X 

- 

X 

X 

69.10 

83.45 

75.60 

79.91 

68.84 

73.96 

A/lO  8 

-1,1 

X 

3 

4 

X 

X 

- 

X 

X 

69.42 

82.63 

75.45 

79.25 

67.16 

72.70 

A/l09 

-1,2 

X 

3 

4 

X 

X 

- 

X 

X 

68.28 

81.95 

74.50 

80.45 

68.48 

73.98 

A/ no 

0,2 

X 

3 

4 

X 

X 

- 

X 

X 

67.88 

81.18 

73.94 

80.05 

68.75 

73.97 

A/112 

3,3 

X 

3 

4 

X 

X 

- 

X 

X 

65.50 

81.20 

72.51 

80.09 

65.43 

72.02 

A/113 

-2.2 

X 

4 

3 

X 

X 

- 

X 

X 

65.93 

81.76 

72.99 

79.34 

67.26 

72.80 

A/114 

-2,1 

X 

4 

3 

X 

X 

- 

X 

X 

66.72 

82.30 

73.70 

79.48 

68.32 

73.48 

A/iig 

-1,1 

X 

4 

3 

X 

X 

- 

X 

X 

69.15 

82.71 

75.32 

78.21 

67.30 

72.34 

A/ 1 16 

-1,2 

X 

4 

3 

X 

X 

- 

X 

X 

66.79 

81.68 

73.49 

79.91 

68.48 

73.76 

A/129 

-2,2 

X 

3 

4 

X 

X 

X 

X 

X 

67.36 

83.07 

74.39 

74.67 

62.48 

68.03 

A  / 1 30 

-2,1 

X 

3 

4 

X 

X 

X 

X 

X 

68.85 

83.54 

75.49 

79.98 

68.91 

74.03 

A/133 

-1.2 

X 

3 

4 

X 

X 

X 

X 

X 

68.15 

82.16 

74.50 

80.50 

68.52 

74.03 

A/137 

-2,2 

X 

4 

3 

X 

X 

X 

X 

X 

65.68 

82.01 

72.94 

79.25 

67.23 

72.75 

A/l40 

-14 

X 

4 

3 

X 

X 

X 

X 

X 

68.94 

82.84 

75.25 

78.16 

67.31 

72.33 

A  / 141 

-1,2 

X 

4 

3 

X 

X 

X 

X 

X 

66.54 

81.90 

73.42 

79.84 

68.51 

73.75 
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Table  2.  Results  on  the  test  set.  Hero  R,  P  and  F  refer  to  recall,  precision  and  F- 
measure,  respectively  (we  report  percentages) 


Model 

Normal  Training 

After  Sampling 

Mixed 

R 

P 

F 

R 

P 

F 

R 

P 

F 

Host  individual  classifier 

71.21 

83.54 

76.88 

81.52 

69.85 

75.24 

71.21 

83.54 

70.88 

Baselme  l 

71.25 

84.12 

77.15 

81.90 

70.21 

75.61 

71.50 

83.98 

77.24 

Baseline  2 

71.34 

84.21 

77.24 

82.06 

70.71 

75.96 

71.69 

84.21 

77.45 

Baseline  3 

71.43 

85.01 

77.63 

82.54 

71.81 

76.80 

73.12 

84.25 

78.29 

GA  based  ensemble 

71.68 

86.07 

78.22 

83.19 

74.25 

78.47 

78.35 

81.38 

79.89 

MOO  based  ensemble 

74.00 

84.82 

79.04 

82.14 

76.39 

79.16 

79.98 

82.24 

81.10 

individual  classifier  shows  the  recall  precision  and  F-measure  values  of  71.21%, 
83.54%  and  76.88%,  respectively.  Thereafter,  we  apply  the  single  objective  GA 
based  approach  [6]  to  determine  the  appropriate  classifier  ensemble.  Overall  eval¬ 
uation  results  of  this  ensemble  along  with  t lie  best  individual  classifier  and  three 
different  baseline  ensembles  are  reported  in  Table  2.  Results  show  that  the  single 
objective  GA  based  ensemble  performs  better  than  the  best  individual  classi¬ 
fier  as  well  as  the  three  baseline  ensembles.  Then,  we  apply  our  proposed  MOO 
based  approach  to  determine  the  appropriate  ensemble  and  its  results  are  also 
shown  in  Table  2.  We  observe  t  he  increments  of  2.16%,  1 .89%,  1.80%  and  1. 11% 
F-measure  values  over  the  best  individual  classifier,  Baseline  L  Baseline  2.  and 
Baseline  3 ,  respectively.  The  proposed  MOO  based  approach  also  attains  an  im¬ 
provement  of  0.82%  F-measure  over  the  corresponding  single  objective  version. 
Statistical  analysis  of  variance,  (ANOVA)  [14].  is  performed  in  order  to  examine 
whether  the  MOO  based  ensemble  technique  really  outperforms  the  best  indi¬ 
vidual  classifier  throe  baseline  ensembles  and  GA  based  ensemble.  ANOVA  tests 
show  that  the  differences  in  mean  recall,  precision  and  F-measure  are  statistically 
significant  as  p  value  is  less  than  0.05  in  each  of  the  cases. 

Our  training  set  is  highly  imbalanced.  The  ratio  between  positive  (NEs)  and 
negative  examples  is  1:11.21.  We  observed  on  the  development  set  that  this 
skewed  distribution  heavily  biases  the  classifiers  towards  the  negative  category, 
and  accordingly  investigated  random  sampling  techniques  to  make  the  ratio  of 
positive  and  negative  examples  more  balanced.  We  experiment  with  a  sampling 
strategy  that  randomly  over-samples  the  positive  examples  until  it  becomes  equal 
to  the  number  of  negative  ones.  This  random  sampling  yields  a  new  set  of  152 
classifiers,  which  are  again  evaluated  on  the  development  set.  Results  reveal  that 
in  most  of  the  cases,  recall  values  are  increased  at  the  cost  of  precisions  with 
respect  to  their  corresponding  older  versions  (constructed  with  the  same  set  of 
features).  However,  the  overall  F-measure  values  are  quite  similar  in  most  of 
the  classifiers.  Results  of  some  classifiers  on  the  sampled  dataset  arc  reported 
in  Table  1  for  the  test  set.  Overall  results  are  presented  in  Table  2  which  shows 
that  the  proposed  multiobjective  based  approach  performs  better  than  the  best 
individual  classifier,  three  baseline  ensembles  and  the  single  objective  GA  based 
ensemble  [6].  Comparison  between  these  two  sets  of  results  also  shows  that  the 


G2 


A.  Ekbal,  S.  Saha,  and  C.S.  Garbe 


later  gains  recall  at  the  cost  of  precision  in  most  of  the  eases.  As  a  result  of 
sampling,  single  and  niultiobjeetive  optimization  based  techniques  attain  overall 
performance  improvements  by  0.25%  and  0.12%  F- measure  points,  respectively. 

The  basic  principle  of  MOO  is  that  objectives  should  be  as  much  conflict¬ 
ing  as  possible  in  nature.  In  the  first  set  (normal  classifiers),  the  recall  values 
are  lower  than  the  precision  values.  But  in  the  second  set  (sampled  classifiers), 
recalls  are  higher  than  precisions  in  general.  These  two  observations  give  an  in¬ 
sight  that  the  capabilities  of  MOO  could  be  best  utilized  if  it  is  executed  on  the 
combination  of  these  two  types  of  classifiers.  T  hus,  we  select  76  best  classifiers 
according  to  their  F-measure  values  from  each  of  these  sets.  Thereafter,  GA  and 
MOO  based  approaches  are  executed  on  these  resultant  152  classifiers.  Evalu¬ 
ation  results  are  reported  in  Table  2  for  the  test  set,  which  again  shows  that 
MOO  based  approach  performs  the  best.  The  proposed  MOO  based  ensemble 
technique  performs  superior  to  the  previous  two  MOO  based  ensembles  with 
more  than  2.06%  (before  sampling)  and  1.94%  (after  sampling)  F-measurcs,  re¬ 
spectively.  The  single  objective  GA  based  ensemble  also  gains  1.67%  and  1.42% 
F-ineasures,  respectively.  Compared  to  the  baseline  models,  we  observe  a  slight 
degradation  of  precision  in  the  proposed  MOO  based  ensemble.  However,  the 
Pareto  optimal  front  of  Figure  I  reveals  that  there  indeed  exists  some  solutions 
with  higher  precision  values. 

Summary  of  Results .  Evaluation  results  reveal  that  the  proposed  approach 
is  truly  able  to  improve  the  performance  of  the  classifiers  by  appropriately  en- 
sembling  them.  Performance  of  the  ensembles  can  further  be  improved  if  we 
combine  the  individual  classifiers,  having  a  variety  of  classification  methodolo¬ 
gies  that  could  achieve  different  rate  of  correctly  classified  individuals.  Moreover, 
MOO  based  approach  provides  a  set  of  trade-off  solutions  from  which  users  can 
choose  the1  desired  one  based  on  their  requirement.  We  also  observe  that  MOO 
performs  superior  to  the  best  individual  classifier,  baseline  models  and  a  single 
objective  GA  based  approach  [6]. 

G  Conclusion 

In  this  paper,  we  present  the  problem  of  selecting  the  appropriate  votes  per  clas¬ 
sifier  for  each  class  in  NER  as  an  optimization  problem.  We  have  assumed  and 
experimentally  verified  that  instead  of  eliminating  some  classifiers  completely,  it 
is  better  to  quantify  the  amount  of  vote  per  classifier  for  each  class.  To  solve  this 
problem,  we  proposed  a  MOO  based  solution  that  can  simultaneously  optimize 
two  different  classification  measures.  Based  on  the  ME  framework,  a  number  of 
different  classifiers  have  been  built  by  selecting  different  feature  combinations 
from  a  set  of  language  independent  features.  Our  proposed  algorithm  is  appli¬ 
cable  for  any  language  due  to  its  language  independent  nature.  The  proposed 
algorithm  has  been  evaluated  for  a  resource  constrained  language  like  Bengali. 
Evaluation  results  showed  that  the  overall  performance  attained  by  the  proposed 
technique  outperforms  the  best  individual  classifier,  three  different  baseline  en¬ 
sembles  and  a  single  objective  optimization  based  ensemble  technique.  In  future 
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we  would  like  to  develop  the  vote  based  classifier  ensembles  using  ot  her  learning 

algorithms  like  Conditional  Random  Field  and  Support  Vector  Machine. 
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Abstract.  The  stable  marriage  problem  has  a  wide  variety  of  practical 
applications,  ranging  from  matching  resident  doctors  to  hospitals,  to 
matching  students  to  schools,  or  more  generally  to  any  two-sided  market. 
We  consider  a  useful  variation  of  the  stable  marriage  problem,  where 
the  men  and  women  express  their  preferences  using  a  preference  list 
with  ties  over  a  subset  of  the  members  of  the  other  sex.  Matchings  arc 
permitted  only  with  people  who  appear  in  these  preference  lists,  hi  this 
setting,  we  study  the  problem  of  finding  a  stable  matching  that  marries 
as  many  people  as  possible.  Stability  is  an  envy-free  notion:  no  man 
and  woman  who  are  not  married  to  each  other  would  both  prefer  each 
other  to  their  partners  or  to  being  single.  This  problem  is  NP-hard. 
We  tackle  this  problem  using  local  search,  exploiting  properties  of  the 
problem  to  reduce  the  size  of  the  neighborhood  and  to  make  loeal  moves 
efficiently.  Experimental  results  show  that  this  approach  is  able  to  solve 
large  problems,  quickly  returning  stable  matchings  of  large  and  often 
optimal  size. 


1  Introduction 

The  stable  marriage  problem  1  is  a  well-known  problem  of  matching  rnen  to 
women  to  achieve  a  certain  type  of  “stability”.  Each  person  expresses  a  strict 
preference  ordering  over  the  members  of  the  opposite  sex.  The  goal  is  to  match 
men  to  women  so  that  there  are  no  two  people  of  opposite  sex  who  would  both 
rather  be  matched  with  each  other  than  with  their  current  partners.  Surprisingly 
such  a  stable  marriage  always  exists  and  one  can  be  found  in  polynomial  time. 
Gale  and  Shapley  give  a  quadratic  time  algorithm  to  solve  this  problem  based 
on  a  series  of  proposals  of  the  men  to  the  women  (or  vice  versa)  [2].  The  stable 
marriage  problem  has  a  wide  variety  of  practical  applications,  ranging  from 
matching  resident  doctors  to  hospitals,  sailors  to  ships,  primary  school  st  udents 
to  secondary  schools,  as  well  as  in  market  trading. 

There  are  many  variants  of  this  classical  formulation  of  the  stable  marriage 
problem.  Some  of  the  most  useful  in  practice  include  incomplete  preference  lists 
(SMI),  that  allows  us  to  model  unacceptability  for  certain  members  of  the  other 
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sex,  and  preference  lists  with  ties  (SMT).  that  model  indifference  in  the  prefer¬ 
ence  ordering.  With  a  SMI  problem,  we  have  to  find  a  stable  marriage  in  which 
the  married  people  accept  each  other.  It  is  known  that  all  solutions  of  a  SMI 
problem  have  the  same  size  [3]  (that  is,  number  of  married  people).  In  SMT 
problems,  instead,  solutions  are  stable  marriages  where  everybody  is  married. 
Both  of  these  variants  are  polynomial  to  solve.  In  real  world  situations,  both  ties 
and  incomplete  preference  lists  may  he  needed  Unfortunately,  when  we  allow 
both,  the  problem  becomes  NP-hard  [3].  In  a  SMTI  (Stable  Marriage  with  Ties 
and  Incomplet  e  lists)  problem,  then1  may  be  several  stable  marriages  of  different 
sizes,  and  solving  the  problem  means  finding  a  stable  marriage  of  maximum  size. 

In  this  paper  we  investigate  the  use  of  a  local  search  approach  to  tackle  this 
problem.  Our  algorithm  starts  from  a  randomly  chosen  marriage  arid,  at  each 
step,  moves  to  a  neighbor  marriage  which  is  obtained  by  removing  one  blocking 
pair,  that  is.  a  man-woman  pair  who  are  not  married  to  each  other  in  the  current 
marriage  but  who  prefer  to  he  married  with  each  other  rather  than  with  with 
their  current  partners.  Stable  marriages  have  no  blocking  pairs,  so  the  aim  of 
such  a  move  is  to  pass  to  a  marriage  which  is  closer  to  stability.  Among  the 
neighbor  marriages,  the  evaluation  function  chooses  one  with  the  smallest  num¬ 
ber  of  blocking  pairs  and  of  singles.  Since  there  may  be  several  stable  marriages 
with  different  sizes,  we  look  for  the  one  with  maximum  size  (that  is,  the  smallest 
number  of  singles),  Random  moves  are  also  used,  to  avoid  stagnation  in  local 
minima.  The  algorithm  stops  when  a  perfect  matching  (that  is.  a  stable  mar¬ 
riage1  with  no  singles)  is  found,  or  when  a  given  limit  on  the  number  of  steps  is 
reached. 

This  basic  local  search  approach  works  well  with  problems  of  limit c'ci  size,  but 
does  not  scale.  With  large?  sizes,  it  fails  to  find  good  solutions*  and  sometimes 
even  stable  marriages.  One  of  the  main  reasons  is  that  the  neighborhood  can  be 
very  large,  since  a  marriage  may  have  a  large  number  of  blocking  pairs.  Many 
such  blocking  pairs  can  be  ignored  since  they  are  "dominated'  by  others,  whose 
removal  will  also  eliminate  all  the  dominated  blocking  pairs.  By  considering 
only  nndominated  blocking  pairs,  wo  can  solve  SMTI  problems  of  much  larger 
size  in  a  small  amount  of  time.  The  marriages  returned  by  onr  local  search 
method  are  stable  and  contain  very  few  single  people.  Experiments  on  randomly 
generated  SMTI  problems  of  size  100  show  that  our  algorithm  is  able  to  find 
stable  marriages  with  at  most  two  singles  on  average  in  tens  of  seconds  at  worst. 

The  SMTI  problem  has  been  tackled  also  in  4],  where  the  problem  is  mod¬ 
eled  as  a  constraint  optimization  problem  and  a  constraint  solver  is  employed 
to  solve  it.  This  systematic  approach  is  guaranteed  to  find  always  an  optimal 
solution.  However,  our  experimental  results  show  that  onr  local  search  algorithm 
in  practice  always  finds  optimal  solutions.  Moreover,  it  scab's  well  to  sizes  much 
larger  than  those  considered  in  [1],  Instances  of  size  comparable  to  ours  are  con¬ 
sidered  in  [5].  However,  the  problem  solved  in  that  paper  is  the  decision  version 
of  onr  optimization  problem.  That  is,  they  ask  if  there  exists  a  stable  marriage 
of  a  certain  size.  Another  approach  is  to  use  approximation.  Given  an  SMTI 
problem,  if  its  maximum  cardinality  stable  marriage  marriages  are  of  size  k\  an 
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a//?-approximation  algorithm  is  able  to  return  a  stable  marriage  of  size  at  least 
0/a  •  k.  The  SMTI  problem  cannot  have  and  /-approximation  algorithm  for  i 
greater  than  33/29  unless  P—NP  [6].  A  3/2-approximation  algorithm  has  been 
proposed  in  [7]. 

2  Background 

2.1  Stable  Marriage  Problems  with  Ties  and  Incompleteness 

A  stable  marriage  (SM)  problem  [1]  consists  of  matching  members  of  two  differ¬ 
ent  sets,  usually  ('ailed  men  and  women.  When  there  are  n  men  and  n  women, 
the  SM  problem  is  said  to  have  size  n.  Each  person  strictly  ranks  all  members 
of  the  opposite  sex.  The  goal  is  to  match  the?  men  with  the  women  so  that  there 
are  no  two  people  of  opposite  sex  who  would  both  rather  marry  each  other  than 
their  current  partners.  Such  a  marriage  is  called  stable.  At  least  one  stable  mar¬ 
riage  exists  for  every  SM  problem.  In  fact,  the  set  of  stable  marriages  forms  a 
lattice.  Gale  and  Shapley  give  a  polynomial  time  algorithm  to  find  the  stable 
marriage  at  the  top  (or  bottom)  of  this  lattice  [2]. 

In  this  paper  we  consider  a  variant  of  the  SM  problem  where  preference  lists 
may  include  ties  and  may  be  incomplete.  This  variant  is  denoted  by  SMTI  [8]. 
Ties  express  indifference  in  the  preference  ordering,  while  incompleteness  models 
unacceptability  for  certain  partners. 

Definition  1  (SMTI  marriage).  Given  a  SAITI  problem  with  n  men  and  n 
women,  a  marriage  Af  is  a  onc-to-onc  matching  between  men  and  women  such 
that  partners  are  acceptable  for  each  other.  If  a  man  m  and  a  woman  w  are 
matched  in  A I ,  we  write  A I  (m)  =  w  and  AI(w)  =  m.  If  a  person  p  is  not 
matched  in  AI  we  say  that  he/she  is  single. 

Definition  2  (Marriage  size).  Given  a  SAITI  problem  of  size  n  and  a  mar¬ 
riage  A I ,  its  size  is  the  number  of  men  (or  women)  that  arc  married. 

An  example  of  a  SMTI  problem  with  four  men  arid  women  is  shown  in  Table  1. 
A  SMTI  problem  is  described  by  giving,  for  each  man  and  woman,  the  corre¬ 
sponding  preference  list  over  members  of  the  other  sex.  For  example,  by  writ  ing 
2  :  2  (3  4)  among  the  men’s  preference  lists  we  mean  that  man  strictly  prefers 
woman  W2  to  women  w$  and  w 4,  that  are  equally  preferred. 


Table  1.  An  example  of  a  SMTI  problem  of  size  4 


men’s  preference 

lists  women’s  preference  lists 

1:  2  1 

1:  3  1  (2  4) 

2:  2  (3  4) 

2:  1  4  2 

3:  (1  2  3  4) 

3:  (1  2)  (4  3) 

4:  (3  2)  1  4 

4:  (3  2  4) 
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Definition  3  (Blocking  pairs  in  SMTIs).  Consider  a  SMTJ  problem  P.  a 
marriage  M  for  P,  a  man  m  and  a  woman  w.  A  pair  (rn.  w)  is  a  blocking  pair 
in  M  if  rn  and  w  find  acceptable  each  other  and  m  is  either  single  in  M  or  he 
strictly  prefers  w  to  and  w  is  either  single  in  M  or  she  strictly  prefers 

rn  to  M(w). 

Definition  4  (Weakly  Stable  Marriages).  Given  a  SMTJ  problem  P.  a  mar¬ 
riage  M  for *  P  is  weakly  stable  if  it  has  no  blocking  pairs. 

As  wo  will  consider  only  weakly  stable  marriages,  we  will  simply  call  them  stable 
marriages.  Given  a  SMTI  problem,  there  may  be  several  stable  marriages  of 
different  size.  If  the  size  of  a  marriage  coincides  with  the  size  of  the  problem,  it 
is*  said  to  be  a  perfect  matching. 

In  the  above  example,  the  marriage  2  3  1  4  (where  the  number  in  position  i 
indicates  the  woman  married  to  man  rn,  in  that  marriage)  is  stable  and  its  size 
is  4,  so  it  is  a  perfect  matching. 

Solving  a  SMTI  problem  means  finding  a  stable  marriage  with  maximal  size. 
This  problem  is  NP-hard  [3]. 

2.2  Local  Search 

Local  search  [9.10]  is  one  of  the  fundamental  paradigms  for  solving  comput  ation¬ 
ally  hard  combinatorial  problems.  Local  search  methods  in  many  cases  represent 
the  only  feasible  wav  for  solving  large  and  complex  instances.  Moreover,  they 
can  naturally  be  used  to  solve  optimization  problems. 

Given  a  problem  instance,  the  basic  idea  underlying  local  search  is  to  start 
from  an  initial  search  position  in  the  space  of  all  solutions  (typically  a  ran¬ 
domly  or  heuristically  generated  candidate  solution,  which  may  be  infeasible, 
sub-optimal  or  incomplete),  and  to  improve  iteratively  this  candidate  solution 
by  means  of  typically  minor  modifications.  At  each  search  step  we  move  to  a 
position  selected  from  a  local  neighborhood,  chosen  via  a  heuristic  evaluation 
function.  The  evaluation  function  typically  maps  the  current  candidate  solution 
to  a  real  number  and  it  is  such  that  its  global  minima  correspond  to  solutions 
of  the  given  problem  instance.  The  algorithm  moves  to  the  neighbor  with  the 
smallest  value  of  the  evaluation  function. 

This  process  is  iterated  until  a  termination  criterion  is  satisfied.  The  termina¬ 
tion  criterion  is  usually  the  fact  that  a  solution  is  found  or  that  a  predetermined 
number  of  steps  is  reached,  although  other  variants  may  stop  the  search  after  a 
predefined  amount  of  time. 

Different  local  search  methods  vary  in  the  definition  of  the  neighborhood 
and  of  the  evaluation  function,  as  well  as  in  the  wav  in  which  situations  are 
handled  when  no  improvement  is  possible.  To  ensure  that  the  search  process  does 
not  stagnate  in  unsatisfactory  candidate  solutions,  most  local  search  methods 
use  randomization:  at  every  step,  with  a  certain  probability  a  random  move  is 
performed  rather  than  the  usual  mow  to  the  best  neighbor. 
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3  Local  Search  on  SMTIs 

We  adapt  the  classical  local  search  schema  to  SMTI  problems  as  follows.  Given 
a  SMTI  problem  I\  we  start  from  a  randomly  generated  marriage  M  for  P.  At 
each  search  step,  we  move  to  a  new  marriage  in  the  neighborhood  of  the  current 
one.  For  each  marriage  M ,  the  neighborhood  N(M)  is  the  set  of  all  marriages 
obtained  by  removing  one  blocking  pair  from  M.  Consider  a  blocking  pair  bp  — 
(m,w)  in  M  and  assume  rnf  =  M(w)  and  wf  =  M(rn).  Then,  removing  bp  from 
Al  means  obtaining  a  marriage  A/'  in  which  in  is  married  with  w  and  both  in' 
and  ie'  become  single,  leaving  the  other  pairs  in  the  marriage  M  unchanged. 
Notice  that,  if  A I  is  stable,  its  neighborhood  is  empty.  Notice  also  that  this 
notion  of  neighborhood  is  not  symmetric. 

To  select  the  neighbor  to  move  to,  we  use  an  evaluation  function  f  :  Mn  — ►  Z, 
where  AAn  is  the  set  of  all  possible  marriages  of  size  n,  and  /(A/)  —  nbp(M)  T 
ns(M).  For  each  marriage  AL  nbp(M)  is  the  number  of  blocking  pairs  in  A/, 
while  ns(M)  is  the  number  of  singles  in  A!  which  arc  not  in  any  blocking  pair. 
The  algorithm  moves  to  a  marriage  AL  €  N(AI)  such  that  f(AI')  <  f(AI ") 
VA/"  €  N(Af). 

During  the  search,  the  algorithm  maintains  the  best  marriage  found  so  far, 
defined  as  follows:  if  no  stable  marriage  has  been  found,  then  the  best  marriage 
is  the  one  with  the  smallest  value  of  the  evaluation  function;  otherwise,  it  is  the 
stable  marriage  with  less  singles. 

To  avoid  stagnation  in  a  local  minimum  of  the  evaluation  function,  at  each 
search  step  we  perform  a  random  walk  with  probability  p  (where  p  is  a  parameter 
of  the  algorithm).  In  the  random  walk,  we  move  to  a  randomly  selected  marriage 
in  the  neighborhood  (we  tried  also  to  move  to  a  generic  random  marriage,  but 
this  gave  worse  behavior).  If  a  stable  marriage  is  reached,  its  neighborhood  is 
empty  and  a  random  restart  is  performed. 

The  algorithm  terminates  if  a  perfect  marriage  (that  is,  a  stable  marriage  with 
no  singles)  is  found,  or  when  a  maximal  number  of  search  steps  is  reached.  Upon 
termination,  the  algorithm  returns  the  best  marriage  found  during  the  search. 

The  pseudo-code  of  our  algorithm  called  I  TI,  is  shown  in  Algorithm  1.  In 
the  pseudo-code,  ALest  is  the  best  marriage  found  so  far,  and  its  evalua¬ 
tion  (number  of  blocking  pairs  plus  number  of  singles).  Function  best  .neighbor 
returns  one  of  the  best  marriages  in  the  neighborhood  of  the  current  marriage, 
according  to  the  evaluation  function. 

In  addition  to  this  simple  local  search  algorithm  which  directly  applies  stan¬ 
dard  local  search  approaches  to  SMTI  problems,  we  have  also  designed  a  more 
sophisticated  algorithm  which  has  been  tailored  to  exploit  the  specific  features 
of  SMTI  problems.  The  main  difference  is  in  the  definition  of  the  neighborhood, 
which  refers  to  the  notion  of  undominated  blocking  pairs. 

Definition  5  (Dominance  in  blocking  pairs).  Let  ( m,  u; )  and  (m,w')  be  two 
blocking  pairs.  Then  (m,  w)  dominates  (from  the  men's  point  of  view)  (m,  iuf) 
if  in  prefers  w  to  wf .  There  is  an  equivalent  concept  from  the  women's  point  of 


view. 
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Definition  G  (Undominated  blocking  pair).  A  men -  ( resp women-)  un¬ 
dominated  blocking  pair  is  a  blocking  pair  such  that  their  is  no  other  blocking 
pair  that  dominates  it  from  the  mens  (resp.,  women's)  point  of  view.  When  the 
point,  of  view  (men  or  women)  is  clear  or  not  important .  we  will  omit  it. 


Algorithm  1.  LTI 

input  :  a  SMTI  problem  P,  an  integer  max-steps,  a  probability  p 
output;  a  marriage 


Af  <—  random  marriage 
steps  <—  0 
Afbest  *  Af 

fbc.St  -  f(AI) 

repeat 

if  f(AI)  =  0  then 
return  Af 

if  rand()  <  p  then 

Af  <—  RandomW atk (A f) 


10 

11 

12 

13 

14 


15 


else 

lyAIRS  <—  blocking  pairs  in  Af 
if  PAI RS  is  empty  then 
perform  a  random  restart 

else 

Af  <—  best  -neighbor  ( Af .  PAI  RS) 


16 

17 

IS 

19 

20 
21 

22 

23 

24 


if  Af  is  the  first  stabte  marriage  found  so  far  then 

|_  fbt»t  +—  f(A!)y  A 1 hr  st  <—  A I 

if  Albert  is  not  stabte  and  fbest  >  f(Af)  then 
L  -  f(M),  Ahest  <-  M 
if  both  Afbest  and  Af  are  stable,  and  fbest  >  /(A/)  then 
L  fbc,t  -  f\M),  -  A/ 

steps  «—  steps  -f  1 
until  steps  >  max  steps  ; 
return  Afbest 


For  example,  consider  t-lie  SMTI  problem  in  Table  1,  the  marriage  I  2  3  4,  and 
two  blocking  pairs  (mi.  //^)  and  (m.j,  w^)-  Using  the  definitions  above,  (m j,  w-i ) 
dominates  (m.4,  w^)  from  the  women’s  point  of  view.  If  we  remove  (77*4,1772)  from 
the  marriage,  (rni,^)  will  remain.  On  the  other  hand,  removing  (mi .  1V2)  also 
eliminates  (77/4,  h^).  Thus  removing  undominated  blocking  pairs  rnay  reduce  the 
number  of  blocking  pairs  more  than  eliminating  dominated  pairs. 

We  call  LT11J  the  algorithm  LTI  where  the  neighborhood  is  defined  as  the  set 
of  marriages  obtained  from  the  current  one  by  removing  any  dominated  blocking 
pair.  More  precisely,  at  each  step  we  consider  the  undominated  blocking  pairs 
from  the  men’s  point  of  view  which  are  also  undominated  from  women's  point 
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of  view.  Notice  that  ,  in  this  step,  the  role  of  men  and  women  matters,  and  will 
yield  a  different  result  if  swapped. 

Then,  to  ensure  gender  neutrality  in  our  algorithm1 ,  in  the  next  step  we  swap 
genders  and  do  the  same. 

Due  to  their  ability  to  restart,  our  algorithms  have  the  PAG  (probabilistically 
approximate  complete  property)  [11].  That  is,  as  their  runtime  goes  to  infinity, 
the  probability  that  the  algorithm  returns  an  optimal  solution  goes  to  one.  If 
the  algorithm  starts  at  a  stable  marriage,  the  algorithms  will  perform  a  random 
restart,  which  will  end  up  in  an  optimal  solution  with  probability  greater  than 
zero.  On  the  other  hand,  if  the  algorithm  starts  from  a  non-stable  marriage,  we 
perform  one  or  more  steps  in  which  we  remove  a  blocking  pair.  This  sequences 
of  blocking  pair  removal  have  been  shown  to  converge  to  a  stable  marriage  with 
11011-zero  probability  in  the  context  of  SMs  with  incomplete  preference  lists  [12]. 
The  proof  of  this  result  can  be  adapted  to  our  context,  as  we  have  tics  m  the 
preference  lists.  Since  a  stable  marriage  can  be  reached  with  non-zero  probability, 
and  as  we  have  argued  above  that  from  any  stable  marriage  random  restarting 
will  reach  an  optimal  solution  with  non-zero  probability,  the  PAC  property  holds. 


4  Experimental  Setting 

Problems  are  generated  using  the  same  method  as  in  [4].  The  generator  takes 
three  parameters:  the  problem’s  size  ?>,  the  probability  of  incompleteness  pi  and 
the  probability  of  ties  p2-  Given  a  triple  (n,  p\ , P2),  a  SMTI  problem  with  n  men 
and  n  women  is  generated,  as  follows: 

1.  For  each  man  and  woman,  we  generate  a  random  preference  list  of  size  11 
i.e.,  a  permutation  of  n  persons; 

2.  We  then  iterate  over  each  man's  preference  list:  for  a  mail  m*  and  for  each 
women  uij  in  his  preference  list,  with  probability  p\  we  delete  Wj  from  mt's 
preference  list  and  delete  m*  from  wj' s  preference  list.  I11  this  way  we  get  a 
possibly  incomplete  preference  list. 

3.  If  any  man  or  woman  has  an  empty  preference  list,  we  discard  the  problem 
and  go  to  step  1. 

4.  We  iterate  over  each  person’s  (men  and  women’s)  preference  list  as  follows: 
for  a  man  irii  and  for  each  woman  in  his  preference  list,  in  position  j  >  2, 
with  probability  p2  we  set  the  preference  for  that  woman  as  the  preference 
for  the  woman  in  position  j  —  1  (thus  putting  the  two  women  in  a  tie). 

Note  that  this  method  generates  SMTI  problems  in  which  the  acceptance  is 
symmetric.  In  fact,  if  a  woman  iv  is  not  acceptable  for  a  man  m,  m  is  removed 
from  ur s  preference  list.  This  does  not  introduce  any  loss  of  generality  because, 
even  if  such  a  removal  is  not  performed,  m  and  w  cannot  be  matched  together 
in  any  stable  marriage. 


Gender  neutrality  is  usually  considered  a  desirable  feature  in  a  stable  marriage 
procedure. 
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Notice  also  that  this  generator  will  not  construct  a  SMTI  problem  in  which 
a  man  (resp.,  woman)  has  in  his  preference  list  only  women  (resp.,  men)  who 
do  not  find  him  (resp.  her)  acceptable.  Such  a  man  (resp..  woman)  will  remain 
single  in  every  stable  matching.  Therefore  a  simple  preprocessing  step  can  remove 
such  men  and  women,  giving  a  smaller  problem  of  the  form  constructed  by  our 
generator. 

We  generated  random  SMTI  problems  of  size  100,  by  letting  P2  vary  in  [0,  1.0] 
with  step  0. 1 ,  and  p\  vary  in  [0. 1 , 0.8]  with  step  0.1  (above  0.8  the  preference  lists 
start  to  be  empty).  For  each  parameter  combination,  we  generated  100  problem 
instances.  Moreover,  the  probability  of  the  random  walk  is  set  to  p=20%  and 
the  search  step  limit  is  ,v=50000. 

4.1  Experimental  Results 

We  run  our  experiments  on  2  x  Quad-Core  AMD  Opteron  2.3GHz  CPU  with 
2GB  of  RAM.  In  practice  we  used  only  one  core  because  our  algorithm  is  not 
designed  for  rnnlti  threading. 

We  first  analyzed  the  behavior  of  the  base  algorithm,  LTL  Unfortunately 
this  algorithm  fails  to  find  a  stable  marriage  in  most  of  our  test  problems  (see 
Figure  1).  In  fact.  LTI  always  finds  a  stable  marriage  for  problems  where  there 
are  many  ties  (that  is,  P2  high)  and/or  a  lot  of  incompleteness  (that  is.  p\  high). 


I 

* 


Fig.  1.  Average  number  of  stable  marriages  found  by  LTI 


On  the  other  hand,  algorithm  LTIU  finds  a  stable  marriage  in  100%  of  the 
rims.  Since  stability  is  essential  in  onr  context,  from  now  on  we  will  only  show 
the  experimental  results  for  algorithm  LTIU. 

We  start  by  showing  the  average  size  of  the  marriages  returned  by  LTIU. 
In  Figure1  2  we  can  see  that  LTIU  finds  a  perfect  marriage  (that  is,  a  stable 
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Fig.  2.  Average  size  of  marriages  with  LTIU 


marriage  with  no  singles)  almost  always.  Even  in  settings  with  a  large  amount  of 
incompleteness  (that  is,  pi  —  0.7  -  0.8)  the  algorithm  finds  very  large  marriages, 
with  only  2  singles  on  average. 

We  also  consider  the  number  of  steps  needed  by  our  algorithm.  From  Fig¬ 
ure  3(a),  we  can  see  that  the  number  of  steps  is  less  than  2000  most  of  the  time, 
except  for  problems  with  a  large  amount  of  incompleteness  (i.e.  p\  =  0.8).  As 
expected,  with  p\  greater  than  0.G,  the  algorithm  requires  more  steps.  In  some 
eases,  it  reaches  the  step  limit  of  50000.  Moreover,  as  the  percentage  of  ties  rises, 
stability  becomes  easier  to  achieve  and  thus  the  number  of  steps  tends  to  de¬ 
crease  slightly.  We  note  that  complete  indifference  (i.e.  P2=l)  is  a  special  ease. 
In  fact,  in  this  situation,  the  number  of  steps  increases  for  almost  every  value  of 
p\ .  This  is  because  the  algorithm  makes  most  of  its  progress  via  random  restarts. 
In  these  problems  every  person  in  a  preference  list  is  equally  preferred  to  all  the 
others.  This  means  that  the  only  blocking  pairs  are  those  involving  singles  who 
both  find  acceptable  each  other.  In  this  situation,  after  a  few  steps  all  singles 
that  can  be  married  are  matched,  stability  is  reached,  and  the  neighborhood 
becomes  empty.  The  algorithm  therefore  performs  another  random  restart.  It  is 
therefore  very  difficult  to  reach  a  perfect  matching  and  the  algorithm  often  runs 
until  the  step  limit. 

The  algorithm  takes,  on  average,  less  than  40  seconds  to  give  a  result  even 
for  problems  with  a  lot  of  incompleteness  (see  Figure  3(b)).  As  expected,  with 
P2  =  1  the  time  increases  for  the  same  reason  discussed  above  concerning  the 
number  of  steps. 

Re-eonsidering  Figure  2  and  the  fact  that  all  the  marriages  the  algorithm  finds 
are  stable,  we  notice  that  most  of  the  marriages  are  perfect. 

From  Figure  4  we  see  that  the  average  percentage  of  matchings  that  are 
perfect  is  almost  always  100%  and  this  percentage  only  dcereascs  when  the 
incompleteness  is  large. 
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(a)  average  number  of  steps 


(b)  average  execution  time 


Fig.  3.  Average  number  of  steps  and  execution  t  ime  for  LrI  IU 


0.1  0  2  0  3  04  0  5  06  0  7  0  6  0  6 


Fig.  4.  Percentage  of  perfect  matchings 


We  compared  our  local  search  approach  to  the  complete  method  from  [4],  hi 
their  experiments,  they  measured  the  maximum  size  of  the  stable  marriages  in 
problems  of  size  10,  fixing  p\  to  0.5  and  varying  P2  in  [0,1 1.  We  did  the  same 
experiments  (generating  new  instances),  and  obtained  stable  marriages  of  a  very 
similar  size  to  those  reported  in  [4].  This  means  that  although  our  algorithm 
is  incomplete  in  principle,  it  always  finds  an  optimal  solution  in  our  randomly 
generated  instances,  and  for  small  sizes  it  behaves  as  a  complete  algorithm  in 
terms  of  size  of  the  returned  marriage.  However,  we  can  also  tackle  problems 
of  much  larger  sizes  (at  least  100),  still  obtaining  optimal  solutions  most  of  the 
times. 

We  also  considered  the  runtime  behavior  of  our  algorithm.  In  Figure  5  we  show 
the  average  normalized  number  of  blocking  pairs  and,  in  Figure  fi,  the  average 
normalized  number  singles  of  the  best  marriage  as  the  execution  proceeds.  Al¬ 
though  the  step  limit  is  50000,  we  only  plot  results  for  the  first  steps  because  the 
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Fig.  5.  Average  normalized  number  of  blocking  pairs  (p2— 0,5) 


Fig.  6.  Average  normalized  number  of  singles  (p2=0.5) 


rest  is  a  long  plateau  that  is  not  very  interesting.  We  shows  the  results  only  for 
P2  =  0.5.  However,  for  greater  (resp.-  lower)  number  of  ties  the  curves  are  shifted 
slightly  down  (resp.,  up)  Front  Figure  5  we  can  see  that  the  average  number  of 
blocking  pairs  decreases  very  fast,  reaching  5  blocking  pairs  after  only  100  steps. 
Then,  after  300-400  steps,  we  reach  0  blocking  pairs  (i.e,  a  stable  marriage)  al¬ 
most  all  the  times  for  all  values  of  p\.  Considering  Figure  0,  we  can  see  that 
the  algorithm  starts  with  more  singles  for  greater  values  of  p\.  This  happens 
because,  with  more  incompleteness,  it  is  more  improbable  for  a  person  to  be  ac¬ 
ceptable.  However,  after  200  steps,  the  average  number  of  singles  becomes  very 
small  no  matter  the  incompleteness  in  the  problem.  Looking  at  both  Figures  5 
and  6,  we  observe  that,  although  we  set  a  step  limit  .s  =  50000,  the  algorithm 
reaches  a  very  good  solution  after  just  300-400  steps.  In  fact,  after  this  number 
of  steps,  the  best  marriage  found  by  the  algorithm  usually  has  no  blocking  pairs 
nor  singles,  i.e,  it  is  a  perfect  matching.  This  appears  largely  independent  of  the 
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amount  of  incompleteness  and  the  number  of  tic's  in  the  problems.  Hence,  for 
SMT1  problems  of  size  100  we  could  set  the  step  limit  to  just  400  steps  and  still 
be  reasonably  sure  that  the  algorithm  will  return  a  stable  marriage  with  a  large 
size,  no  matter  the  amount  of  incompleteness  and  ties. 

5  Conclusions 

We  have  presented  a  local  search  approach  for  solving  stable  marriage  problems 
with  ties  and  indifference.  Experimental  results  show  that  our  algorithm  is  both 
fast  and  effective  at  finding  large  stable  marriages  .Moreover,  the  runtime  behav¬ 
ior  of  the  algorithms  is  not  greatly  influenced  b\  the  amount  of  incompleteness 
or  ties  ill  the  problem.  The  algorithm  was  usually  able  to  obtain  a  very  good 
solution  after  a  very  small  amount  of  time. 

Future  directions  include  an  assessment  of  the  trade-off  between  the  cost  of 
finding  the  lindominated  blocking  pairs  and  that  of  treating  larger  neighbor¬ 
hoods.  We  also  plan  to  apply  a  local  search  approach  to  other  versions  of  the 
SMTI  problem  and  to  study  other  variant  ofonr  algorithm,  for  example  including 
tabu  search  or  other  greedy  heuristics. 
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Abstract.  Conventional  methods  for  multimodal  data  retrieval  use  text-tag 
based  or  cross-modal  approaches  such  as  tag-image  eo-oceurrence  and  canoni¬ 
cal  correlation  analysis.  Since  there  are  differences  of  granularity  in  text  and 
image  features,  however,  approaches  based  on  lower-order  relationship  between 
modalities  may  have  limitations.  Here,  we  propose  a  novel  text  and  image  key¬ 
word  generation  method  by  cross-modal  associative  learning  and  inference  with 
multimodal  queries.  We  use  a  modified  hypernetwork  model,  i.e.  layered  hy- 
pemetworks  (LHNs)  which  consists  of  the  first  (lower)  layer  and  the  second 
(upper)  layer  which  has  more  than  two  modality-dependent  hypernetworks  and 
one  modality-integrating  hypemetwork,  respectively.  LHNs  learn  higher-order 
associative  relationships  between  text  and  image  modalities  by  training  on  an 
example  set.  After  training,  LHNs  are  used  to  extend  multimodal  queries  by 
generating  text  and  image  keywords  via  cross-modal  inference,  i.e.  text-to- 
image  and  image-to-text.  The  LHNs  are  evaluated  on  Korean  magazine  articles 
with  images  on  women  fashions  and  life-style.  Experimental  results  show  that 
the  proposed  method  generates  vision -language  cross-modal  keywords  with 
high  accuracy.  The  results  also  show  that  multimodal  queries  improve  the  accu¬ 
racy  of  keyword  generation  compared  with  uni-modal  ones. 

Keywords:  hypernetwork,  layered  hypemetwork,  cross-modal  generation,  vi¬ 
sion-language,  text-to-image,  image-to-text.  multimodal  information  retrieval. 


1  Introduction 

Recently,  cross-modal  learning  methods  have  been  considered  as  a  major  approach 
for  multimodal  information  retrieval  such  as  video,  image,  and  article  retrieval  as  well 
as  automatic  tagging  and  annotation  [1-3).  Because  there  are  differences  of  granular¬ 
ity  in  text  and  image  features,  however,  simple  approaches  based  on  text-image 
relations  have  the  limitation  to  learn.  As  a  model  to  learn  higher-order  cross-modal 
associations,  we  used  hypemetwork  models  in  the  previous  study  [4],  A  hypernetwork 
is  a  higher-order  probabilistic  graphical  model  which  has  properties  including  glocal- 
ity,  compositionality,  self-assembly,  and  recall-memory  [5].  In  the  previous  study,  we 


B.  T.  Zhang  and  M.A.  Orgun  (Eds.):  PRICAI  2010.  LNAI  6230.  pp.  76-87,  2010. 
©  Spnngcr-Verlag  Berlin  Heidelberg  2010 


LHN  Models*  for  Cross-Modal  Associative  Text  and  Image  Keyword  Generation 


77 


showed  that  images  eould  be  retrieved  with  multimodal  queries  by  text-to-image 
inference  with  trained  hypemetworks  [4J. 

In  this  study,  we  propose  a  novel  modified  hypernetwork  model,  layered  hyper¬ 
networks  (LHNs),  whieh  conducts  cross-modal  associative  learning  and  inference 
including  image-to-text  as  well  as  text-to-image  for  multimodal  information  retrieval. 
An  LHN  is  a  hypernetwork  model  with  a  hierarchical  structure  of  two  layers  of  hy¬ 
pernetwork.  While  the  first  layer  is  composed  of  modality-dependent  hypernetworks, 
only  one  hypernetwork  exists  in  the  seeond  layer  which  represents  relationships  be¬ 
tween  the  text  modality  and  the  image  modality.  The  hierarchical  structure  make 
LHNs  analyzed  with  efficiency  compared  with  conventional  hypemetworks.  Trained 
LHNs  can  generate  both  text  and  image  keywords  hy  cross-modal  associative  infer¬ 
ence  with  multimodal  queries.  In  addition,  generated  visual  and  textual  keywords  are 
used  to  retrieve  articles  by  comparing  them  with  text  terms  in  document  and  visual 
words  in  images  of  articles.  We  use  983  Korean  magazine  articles  with  8,763  images 
on  women  fashion  and  life-style  as  multimodal  data.  In  this  study,  our  contributions 
are  summarized  as  follows. 

1.  We  propose  a  novel  modified  hypernctwork  named  to  layered  hypernetwork 
for  cross*- modal  associative  learning  and  inference. 

2.  We  propose  a  method  to  generate  visual  and  textual  keywords  based  on  text- 
to-image  and  image-to-text  cross-modal  association. 

3.  We  apply  the  proposed  model  to  magazine  article  retrieval. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  summarize  related 
works.  Also,  we  explain  layered  hypemetworks  for  cross-modal  association  in 
Section  3  and  propose  a  method  for  cross-modal  keyword  generation  in  Section  4. 
Section  5  presents  the  experimental  results.  Finally,  we  present  concluding  remarks  in 
Section  6. 

2  Related  Works 

As  multi-media  data  increase  explosively,  multimedia  data  retrieval  has  heen  impor¬ 
tant  problem  in  information  retrieval  such  as  video,  image  and  articles.  As  an  ap¬ 
proach,  eross-modal  associative  learning  has  been  applied  to  multimodal  data  retrieval 
although  eross-modal  learning  is  from  cognitive  science  and  neuroscience  [6].  Snoek 
et  al.  proposed  concept-based  video  retrieval  method  |7]  and  Yan  et  al.  studied  a 
multimodal  retrieval  approach  including  text  and  image  for  broadcast  new  video  [8]. 
D.  Li  et  al.  [9 1  suggested  cross-modal  association  based  faetor  analysis  method  as 
alternatives  to  latent  Semantic  Indexing  (LSI)  and  Canonical  Correlation  Analysis 
(CCA).  Ferecatu  et  al.  showed  that  the  joint  use  of  visual  features  and  concept-based 
features  with  relevance  feedback  scheme  improves  the  quality  of  the  cross-modal 
image  retrieval  1 10],  Goh  et  al.  proposed  an  image  retrieval  method  based  on  multi 
modal  eoncept-dependent  active  learning  [2|.  Also,  auto-annotation  on  unlabeled 
images  and  objects  in  images  is  carried  out  by  using  hierarchical  latent  Dirichlct  allo¬ 
cation  model  [  1  1 1.  In  addition,  human -computer  interaction  (HCI)  is  a  rescare  h  where 
cross-modal  learning  is  considered  as  an  essential  clement.  In  HCI,  various  modalities 
are  studied  including  speeches  and  gestures.  Quek  et  al.  studied  multimodal  human 
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discourse  in  aspect  of  gesture  and  speech  [12].  Christoudias  et  al.  proposed  eo- 
traimng  method  of  multimodal  data  to  eonstruet  multimodal  interface  [13].  However, 
conventional  studies  on  cross-modal  learning  are  usually  based  on  lower-order  co¬ 
occurrence  on  modalities  rather  than  higher-order  relations.  Therefore,  we  propose  a 
cross-modal  learning  method  based  on  higher-order  inter-modal  relationships  in  this 
paper. 

3  Cross-Modal  Associative  Learning  Models 

3.1  Hypernetwork  Model 

A  hypernetwork  is  a  bio-inspired  probabilistic  graphical  model  based  on  hypergraph 
models.  The  properties  of  the  hypemetwork  model  are  summarized  as  three  aspects: 
gloeality,  eompositionality  and  self  association  based  on  randomness  and  reeall  [5]. 

1.  Gloeality:  A  hypernetwork  consists  of  hyperedges  with  various  orders. 

Lower-order  hyperedges  can  represent  general  information  and  higher-order 
ones  include  more  speeifie  and  loeal  information. 

2.  Compositionality:  A  hypernetwork  represents  a  huge  structured  combinato¬ 
rial  spaee.  By  learning  based  evolutionary  strategy,  a  hypernetwork  explores 
the  combinatorial  problem  spaee. 

3.  Self  association:  The  structure  of  hypernetworks  is  self-organized  by  evolu¬ 
tionary  computation  based  on  random  selection.  Self  association  makes  the 
hypernetwork  act  like  a  reeall  memory. 

Formally,  a  hypemetwork  H  is  defined  as  //  =  ( V,  E,  W)  where  V,  E,  and  W  are  a  set 
of  vertices,  hyperedges,  and  weights.  In  hypernetworks,  a  vertex  means  a  value  of 
attributes  and  a  hyperedge  represents  the  combination  of  more  than  two  vertices  with 
its  own  weight.  The  number  of  vertices  in  a  hyperedge  is  called  cardinality  or  order  of 
a  hyperedge  and  A-hyperedge  denotes  a  hyperedge  with  A  vertices.  When  orders  of  all 
hyperedges  are  A,  we  call  it  A-hypemetwork.  Therefore  hypernetworks  ean  represent 
higher-order  relationships  among  large  numbers  of  attributes. 

Since  a  hypernetwork  can  be  regarded  as  a  probabilistic  associative  memory  model 
to  store  segments  of  a  given  data  set  /)  =  {X(,,)LV_I  i.e.  x={jCi,A‘2,...Am},  a  learned  hyper¬ 
network  can  retrieve  a  data  sample  later.  When  7(x(n),  £,)  denotes  a  function  which 
yields  the  combination  or  concatenation  of  elements  of  E,  as  (2),  then,  the  energy  of 
hypernetwork  is  defined  as  follows: 


i=i 


(2) 


where  vvfa>  is  a  weight  of  i-th  hyperedge  E,  with  A-order,  x(,,)  means  the  n-th  stored 


pattern  of  data  and  E,  is  {a*,|,  a*/2,  a,*}.  Then,  the  probability  of  the  data  generated 

by  a  hypemetwork  P(D\W)  is  given  as  a  Gibbs  distribution: 
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P(D\W)  =  Y\P(^n)  IW),  (3) 

n*t 

P(xtm>  IW)  =  — j— expf-ffx'^lV)),  (4) 

where  Z(IV)  is  a  partition  function.  In  addition,  the  partition  function  Z(\V)  is  formu¬ 
lated  as  follow: 


Z(W)  = 


Z  exp 


(5) 


That  is.  a  hypernetwork  is  represented  with  a  probability  distribution  of  combination 
of  variables  with  weights  as  parameters  when  we  consider  attributes  in  data  as  random 
variables.  Considering  that  learning  of  hypernetworks  is  selecting  hyperedges  with 
high  weight  value,  the  learning  ean  be  considered  as  the  proeess  for  maximizing  log- 
likelihood.  Leaning  from  data  is  regarded  as  maximizing  probability  of  weight  pa¬ 
rameter  of  a  hypernetwork  for  given  data.  Given  data,  probability  of  a  weight  set  of 
hyperedges'  P(W\D)  is  defined  as  follows: 


P(\V  I  D)  = 


P{  DWV)P(W) 
P(D) 


According  to  (4)  and  (6),  then,  likelihood  is  defined  as 


N 

Plnx'’"  I1V)/,(W') 


rP(W) 

,Z(IV) 


exp]-^f(x'n’  IVV) 


(6) 


(7) 


Ignoring  P(W)9  maximizing  the  argument  of  exponential  function  is  obtaining  maxi¬ 
mum  likelihood.  Using  log  function. 


arg  max 

w 


log.  I  \V) 


=  argmax|Z!^u;a7(x,n,./r,)-A'IogZ(lV)l  .  (8) 


More  explanations  on  the  derivative  of  the  log -likelihood  are  showed  in  [5].  There¬ 
fore,  log-likelihood  of  hypernetwork  ean  be  maximized  hy  decreasing  the  difference 
of  hyperedges  from  a  given  data  set. 


3.2  Layered  Hypernetworks 

An  LHN  is  a  hypernetwork  with  hierarchical  structures  and  the  model  consists  of  two 
layers.  The  first  layer  is  a  modality  layer  and  the  second  one  is  an  integrating  layer. 
When  data  consisting  of  more  than  one  modality  are  given,  the  attributes  of  given  data 
are  partitioned  based  on  modalities.  Hypernetworks  in  the  first  layer  are  built  by  sam¬ 
pling  from  attributes  of  eaeh  modality  and  the  number  of  hypernetwork  in  the  first 
layer  is  equal  to  the  number  of  modalities.  Dissimilar  to  the  first  layer,  only  one  hy¬ 
pernetwork  exists  in  the  second  layer.  The  second  layer  hypernetwork  is  built  by 
combining  hyperedges  randomly  selected  from  modality-dependent  hypernetworks  in 


80 


J.-W.  Ha  et  ai. 


the  first  layer.  Therefore  the  hypernetwork  in  the  second  layer  represents  the  relation¬ 
ship  between  several  modalities.  Same  as  conventional  hypemetworks,  formally,  the 
second-layer  hypernetwork  is  defined  with  the  energy  function  when  a  weight  vector 
is  given  as  a  parameter.  When  given  a  data  set  D  consisting  of  two  modalities, 
D  =  {x(,,)  =  { (m1 , m2 )<#1>  }*=l  ,  the  energy  of  the  second-layer  hypernetwork 

£(x<rt);U0  generated  from  /c-hypernetworks  in  the  first-layer  is  defined  as  follows: 

\F.\ 

£(x(,,);lf )  =  ^{(m1,m  )‘M);VV'}  =  -]T vva7{(m  ,m2)<”),£ },  (9) 

f=i 

where  m1  and  in2  are  vectors  of  each  modality  variable  which  constitute  the  n- th  data 
sample  x(,,).  Same  as  (4),  then,  the  probability  of  generating  /M h  data  with  two  mo¬ 
dalities,  ^(x^lVV)  is  defined  as  follows: 

/J(x‘”>  IW,)  =  -^Tyexp[-£{(m',m2),",;M'}].  (10) 

Assuming  that  m  ,  nr  are  text  and  image  modality  respectively,  similar  as  conven¬ 
tional  hypernetwork,  the  probability  of  data  generated  by  layered  hypernetworks, 
P(D\W)  is  defined  as  follows: 

P(D  I  W)  =  P(T , /  I W)  =  P(T  I  PW)P(1 1 W) 

=  P(J  \  T,W)P(T  I W),  (11) 

Formula  (11)  means  that  eross-modal  inferences  between  text  and  image  are  earned 
out  by  learning  parameters  of  hypernetworks.  Figure  1  shows  the  architecture  of 
LHNs. 

3.3  Cross-Modal  Associative  Learning  of  Layered  Hypernetworks 
3.3.1  Learning  of  the  First-Layer  Hy  pernetw  orks 

Learning  of  the  first-layer  hypemetworks  is  similar  to  the  learning  of  conventional 
hypernetworks  [4-5]  except  building  a  hypernetwork  per  one  modality.  At  first,  mul¬ 
timodal  data  are  separated  by  modalities.  In  this  study,  an  article  data  with  unique  id 
are  divided  into  vectors  of  TF-1DF  values  from  documents  and  vectors  of  histogram 
value  from  included  images.  The  unique  id  is  used  to  combine  hyperedges  of  each 
modality  in  learning  of  the  second-layer  hypernetwork.  Building  a  hypernetwork  is 
carried  out  by  generating  hyperedges  from  each  modality  and  hyperedges  are  gener¬ 
ated  by  selecting  and  combining  the  attributes  with  non-negative  values  with  random¬ 
ness  for  each  modality.  The  reason  to  select  the  attributes  with  non-negative  values  is 
that  hyperedges  where  values  of  all  vertices  are  zero  may  be  generated  with  high 
probability  because  most  attributes  have  zero  value  due  to  sparsity  of  data.  As  ex¬ 
plained  in  Section  3,  learning  of  hypernetwork  is  sampling  hyperedges  which  are  less 
different  from  data  set.  Details  of  building  and  learning  a  hypernetwork  are  explained 
in  [5].  As  learning  continues,  the  structure  of  a  hypernetwork  fits  the  distribution  of 
given  data  more.  The  constitution  of  hyperedges,  the  structure  of  a  hypernetwork,  is 
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determined  by  their  weights  which  reveal  the  fitness  with  training  data  set.  In  this 
study,  we  define  the  weight  of  a  hyperedge,  u ,  as  follows: 

C 

w  = - — - ,  (12) 

#of  matched  training  samples  +  k 

where  k  denotes  order  of  a  hyperedge  and  C  is  an  arbitrary  constant.  According  to 
(12),  hyperedges  with  unique  information  get  higher  weights  by  definition  Also, 
hyperedges  with  low  weight  values  are  eliminated  and  the  erased  amounts  of  hyper- 
edges  are  regenerated  from  training  set. 


The  2nd  layer  hypernetwork: 


TF/IDF  vector  of  visual  words  Histogram  vector  of  visual  words 


Fig.  1.  Architecture  of  layered  hypernetwork  models 
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3.3.2  Learning  of  the  Second-Layer  Hypernetwork 

Learning  of  the  second- layer  hypernetwork  is  to  generate  hyperedges  which  represent 
high-order  relationships  between  modalities  from  the  first-layer  hypemetworks.  Hy¬ 
peredges  of  the  second-layer  hypernetwork  are  generated  by  combining  hyperedges  of 
hypemetworks  in  the  first-layer.  In  combining,  hyperedges  from  different  modalities 
with  the  same  id  are  merged  into  a  new  hyperedge.  The  weight  of  the  generated  hy¬ 
peredge  is  obtained  by  comparing  with  training  set  same  and  hyperedges  with  low 
weights  arc  also  eliminated  from  the  hypernetwork  same  as  learning  in  the  first  layer. 
Then,  the  generated  hypernetwork  is  evaluated  with  training  data  set.  Figure  2  shows 
the  process  of  making  and  learning  a  layered  hypernctwork.  In  addition,  algorithm  of 
building  and  learning  the  second-layer  hypernetwork  is  presented  in  detail  in  Figure  3. 
In  our  method,  learning  process  finishes  after  fixed  number  of  epochs. 


Hf.  hypernetwork  from  text  data,  Hj.  hypernetwork  from  image  data, 

Hl:  layered  hypernetwork  R.  replacing  rate  of  hyperedges  with  low  weights 
CR:  combining  rate  of  hyperedges  of  with  a  hyperedge  of  Hr 
Hr<-  makeHypernetwork(  7);  Hj  «-  makeHypemetwork^ 

For  /<- 1  until  end  condition 

Hr  <-  learningHypernetwork(7);  Hj  <-  learningHypernetwork^; 

Hr «-  removeLowedges(A);  Hj «-  removel_owedges(/<);  HL  «-  {}; 

For  j<-l  to  |  Ht\ 

Et  <-  the  j- th  hyperedge  of  Hr 
For  k  <-l  to  CR 

Ej  <-  a  randomly  selected  hyperedae  with  same  id  to  Er  from  Hj] 
El^  Er  U  Ej]  Hl  -  Hl  U  El 
End  For 
End  For 

HL  <-  removeLowedges(A);  HL  <-  learningHypernetwork(  T,  I), 
evafuat e{HL,  I,  7] 

Ht-  Resampling(7^  R);  Hj  -  Resampling^  R) 

End  For 


Fig.  3.  Algorithm  of  building  and  learning  a  layered  hypemetwork.  Details  of  functions  for 
learning  are  explained  in  our  previous  studies  [4-5]. 


4  Cross-Modal  Inference  for  Image  and  Text  Keyword 
Generation 

Trained  LHNs  can  generate  both  text  terms  and  visual  words  with  given  multimodal 
queries  by  cross-modal  associative  inference.  Cross-modal  associative  generation  is 
divided  into  two  types  such  as  text-to-image  to  generate  a  set  of  visual  words  for 
given  text  terms  and  image-to-text  generation  to  reconstruct  a  set  of  text  terms  with 
visual  words.  In  image-to-text,  the  generated  set  of  text  terms  is  composed  of  text 
terms  in  hyperedges  of  the  second-layer  hypernetwork  whose  vertices  include  at  least 
one  visual  word  in  the  given  set  of  visual  words.  To  select  text  terms,  we  define  a 
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score  based  on  co-oeeurrenee  of  text  terms  and  visual  words.  When  a  visual  word  set 
Q ,  the  score  of  the  /- th  text  term  in  the  //- th  hyperedge  En  of  the  second-layer 

hypernetwork  is  defined  as  follow: 


Shtxd).En  | 


xu*m  Xu« 


( QnEu  *0) 

\Q~E„  IxC  +  l 

0  (0n£a=0) 


(13) 


where  -Xfdx{i)  is  the  value  of  text  term  attribute  whose  index  is  Idx(i\  Id.x(i)  denotes  the 
index  in  the  veetor  representation  of  the  /'-th  text  term  of  a  hyperedge  En ,  wn  means 
weight  of  En<  1 Q  -  E„ I  is  the  size  of  the  relative  complement,  and  C  is  a  arbitrary 
constant  for  penalty.  Therefore,  sldxii)  is  obtained  by  summing  for  all  hyperedges  as 
follow: 

\E\ 

S!d\  (»)  “  Xj  SMx  1 1  «  (14) 

n- 1 


where  I£1  denotes  the  number  of  hyperedges  in  the  second-layer  hypemetwork.  Ac¬ 
cording  to  (13),  as  a  hyperedge  includes  more  visual  words  in  given  visual  word  set, 
the  score  of  text  terms  in  the  hyperedge  gets  larger.  Then,  text  terms  with  higher  score 
are  included  candidates  for  generated  text  keywords. 

Same  as  image-to-text,  a  set  of  visual  words  are  generated  with  trained  layered  hy¬ 
pernetwork  and  given  text  terms. 

5  Experimental  Results 

5.1  Data  and  Experimental  Setups 

We  use  983  articles  with  8,673  images  from  three  Korean  magazines  on  female  fash¬ 
ion  name  to  ‘luxury’,  ‘beauty  life'  and  ‘haute’  respectively  as  training  data  from  a 
company  named  to  ddh  eo.  As  preprocessing  for  modeling,  documents  in  articles  are 
converted  to  veetors  of  TF-1DF  values  of  5.000  text  terms  which  are  selected  by 


Table  I.  The  parameters  used  for  the  experiment 


Parameters 

Value 

Order  (text,  image) 

(20,  20) 

Replacing  rate 

0.1 

Sampling  rate  (text,  image) 

(20,  10) 

Combining  rate 

10 

Num.  of  iteration 

5 

Combining  rate  means  the  combining  number  of  hyperedges  of  one  modality  hypemetwork  for  a  hyperedge 
of  the  other  modality  hypemetwork  in  learning  of  the  second  layer.  Sampling  rate  denotes  the  size  of 
sampled  hyperedges  from  a  training  data  sample.  Replacing  rate  is  eliminated  ralio  of  hyperedges  wilh  low 
weight  in  one  iteration. 
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occurrence  frequency  in  documents  after  stemming.  Also,  an  image  is  represented 
with  a  vector  of  histograms  of  4,022  visual  words  extracted  by  SURF  [14].  Then, 
values  of  each  modality  are  converted  to  three-level  values  from  0  to  2  since  hyper¬ 
network  models  can  deal  with  discretized  data.  Data  are  divided  into  a  training  set 
with  884  documents  and  7,555  images  and  a  test  set  consisting  of  99  documents  and 
845  images  for  article  retrieval.  Table  1  shows  the  parameter  setting  to  train  layered 
hypernetworks. 


5.2  Experimental  Results 


We  evaluate  the  similarity  of  cross-modal  associative  generation  by  comparing  gener¬ 
ated  text  terms  and  visual  words  with  text  and  image  keywords  in  the  given  query.  To 
evaluate  the  similarity,  we  define  two  measures  in  this  paper.  The  first  measure  is 
ratio  of  correctness  (RC).  Referring  a  set  whose  elements  are  text  terms  and  visual 
words  which  constitute  a  document  and  an  image  in  an  article  to  an  original  set,  we 
generate  text  terms  or  visual  words  as  same  amount  as  the  size  of  the  original  set. 
Then  we  compare  a  generated  textual  or  visual  set  with  the  original  set  when  partial 
text  terms  and  visual  words  are  given.  RC  is  defined  as  follow: 

#  of  generated  keywords  same  to  keywords  in  an  original  set  , .  _ . 

RC  = - — - — - .  (15) 

#  of  generated  text  (image)  keywords 

According  to  (15),  RC  can  have  a  value  from  0  to  1.  The  second  measure  is  context 
score  (CS)  which  are  based  on  pair-wise  co-occurrence  of  all  text  terms  and  visual 
words  with  non-negative  value  in  documents  and  images  of  article  data.  To  obtain 
CS.  wc  define  a  measure  of  pair-wise  co-occurrence  for  the  i-th  and  7-th  keyword  as 
follow: 


x^xx^ 


mT  ^'{ur,)2-uf)2}2 


+  1 


(i  *  j ) 
«  =  j) 


(16) 


where  xt  and  Xj  is  the  value  whose  indices  are  i  and  j  in  the  /z-th  data  sample  x(n),  N  is 
the  size  of  data  set,  and  C  is  a  arbitrary  constant.  Then,  CS  is  defined  as  follow: 


cs=— yv  (i7) 

ic;ifr  " 

where  IGI  is  the  size  of  set  of  generated  text  terms  or  visual  words.  The  different  point 
of  CS  from  RC  is  that  CS  reflects  the  contexts  of  relationships  between  generated 
keywords.  Although  RCs  of  two  generated  sets  are  same,  CSs  may  be  different  each 
other  dependent  on  the  co-occurrence  frequency  of  wrongly  generated  keywords. 
Figure  4  and  5  are  the  result  of  text-to-image  generating  visual  words  and  image-to- 
text  generating  text  terms  for  all  training  data  when  a  few  text  terms  and  visual  words 
arc  given  as  a  query.  Figure  4  shows  average  RC  and  CS  of  generated  text  terms 
by  image-to-text  generation  for  889  documents.  Cross-modal  queries  can  improve 
more  40%  point  of  accuracy  of  the  generation  of  text  terms  related  to  given  queries 
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Fig.  4.  Average  RC  (a)  and  CS  (b)  of  generated  visual  words  by  image-to-text  generation 
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Fig.  5.  Average  RC  (a)  and  CS  (b)  of  generated  keywords  by  text-to-image  generation.  Seale  of 
context  score  of  text  to  image  generation  is  much  larger  than  one  of  text-to-image  generation 
since  the  size  of  image  data  is  approximately  ten  times  and  non-zero  variables  in  histogram 
vector  of  images  are  much  more  than  in  TF-IDF  vector  of  documents 
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Fig.  6.  Articles  whose  text  terms  are  gener¬ 
ated  perfectly  with  given  one  text  term  and 
20%  of  visual  words  in  the  article 


Fig.  7.  Ratio  of  successful  retrievals  for  test 
data  set  as  the  number  of  given  text  terms 
increases 
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compared  with  text  query  only.  According  to  Figure  4,  when  the  same  amount  of  text 
terms  is  given,  the  similarity  seore  of  generated  text  terms  get  higher  as  information  of 
given  image  increase.  Also,  without  any  text  keyword  query,  text  terms  in  the  original 
set  can  generated  with  partial  images  only.  Figure  5  presents  average  RC  and  CS  of 
generated  visual  words  by  text-to- image  generation  for  884  images  among  training 
images.  Same  as  Figure  4,  multimodal  information  increases  two  seores  compared  with 
image  input  only.  Dissimilar  to  image-to-text  generation,  RCs  are  saturated  when  more 
than  two  text  terms  are  given.  In  addition,  CSs  show'  different  patterns  from  image-to- 
text  generations.  It  is  the  reason  that  an  artiele  consists  of  one  document  and  several 
images  so  that  image  information  is  more  important  than  text  information.  Figure  6 
shows  four  pairs  of  the  set  of  text  terms  and  an  image  of  articles  whose  RCs  are  1 
when  one  text  terms  and  20%  of  visual  words  in  the  article  are  given  as  a  query.  We 
can  generate  text  terms  and  retrieve  the  artiele  with  small  part  of  information  by  cross- 
modal  associative  generation.  Figure  7  presents  the  ratio  of  successful  artiele  retrieval 
when  partial  text  terms  of  a  data  are  given  for  test  data  set  using  trained  layered  hyper¬ 
network.  In  this  study,  artiele  retrieval  is  considered  to  be  successful  when  candidates 
inelude  the  test  artiele  whose  text  terms  and  visual  words  are  given  as  a  query.  Accord¬ 
ing  to  Figure  7,  with  both  more  than  two  text  terms  and  half  of  image,  the  article  which 
a  user  wants  can  be  included  over  90%  when  the  size  of  candidates  is  20. 

6  Concluding  Remarks 

In  this  paper,  we  propose  LHNs  for  eross-modal  associative  learning  and  a  method  to 
generate  visual  and  textual  keywords  based  on  text-to-image  and  image-to-text  eross- 
modal  inference  with  LHNs  forgiven  multi-modal  queries.  Experimental  results  show 
that  it  is  possible  to  generate  keywords  based  on  cross-modal  association  of  inter¬ 
modalities.  Also,  multimodal  queries  improve  the  similarity  of  generated  keywords 
compared  with  uni -modal  ones.  In  addition,  we  show  that  proposed  model  and 
method  can  be  applied  to  an  articles  retrieval  system.  As  future  works,  we  will  apply 
the  cross-modal  associative  keyword  generation  method  to  various  problems  sueh  as 
auto-annotation  for  unlabeled  images  as  well  as  multimodal  information  retrieval. 
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Abstract.  Humans  can  associate  vision  and  language  modalities  and  thus  gen¬ 
erate  mental  imagery,  i.e.  visual  images,  from  linguistic  input  in  an  environment 
of  unlimited  inflowing  information.  Inspired  by  human  memory,  we  separate  a 
text-to-image  retrieval  task  into  two  steps:  1)  text-to-image  conversion  (generat¬ 
ing  visual  queries  for  the  2  step)  and  2)  image-to- image  retrieval  task.  This 
separation  is  advantageous  for  inner  representation  visualization,  learning  in¬ 
cremental  dataset,  using  the  results  of  content-based  image  retrieval.  Here,  we 
propose  a  visual  query  expansion  method  that  simulates  the  capability  of  human 
associative  memory.  We  use  a  hyperenetwork  model  (HN)  that  combines  visual 
words  and  linguistic  words  HNs  learn  the  higher-order  cross-modal  associative 
relationships  incrementally  on  a  set  of  image-text  pairs  in  sequence.  An  incre¬ 
mental  HN  generates  images  by  assembling  visual  words  based  on  linguistic 
cues.  And  we  retrieve  similar  images  with  the  generated  visual  query.  The 
method  is  evaluated  on  26  video  clips  of  ‘Thomas  and  Friends'.  Experiments 
show  the  performance  of  successive  image  retrieval  rate  up  to  98.1%  with  a 
single  text  cue.  It  shows  the  additional  potential  to  generate  the  visual  query 
with  several  text  cues  simultaneously. 

Keywords:  hypemetwork,  incremental  data,  visual  query  expansion,  vision- 
language,  text-to-image,  multimodal  information  processing. 


1  Introduction 

Conventional  text-to-image  retrieval  methods  for  image-text  corpus  have  used  the 
annotated  tags  on  images  that  are  used  for  searching  for  the  target  [I].  Recently, 
multi-modal  data  such  as  video,  sound,  images  as  well  as  web-pages  including  images 
are  increasing  explosively.  Consequently,  the  underlying  data  distribution  may  change 
over  time  [3].  So,  we  need  incremental  models  to  learn  the  data  of  multi-modality. 

Humans  can  associate  vision  and  language  modalities  and  thus  generate  mental 
imagery,  i.e.  visual  images,  from  linguistic  input  in  the  environment  of  unlimited 
inflowing  information  Considering  human  capability  of  multimodal  memory 
[2,5,16],  we  separate  a  text-to-image  retrieval  task  into  two  steps.  In  the  first  step, 
text-to-image  conversion  is  used  to  generate  the  visual  concept  from  the  related 
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images  associated  with  text  cues.  And  the  second  step  is  to  search  for  similar  images 
with  the  expanded  visual  query  from  the  first  step.  This  approach  gives  some  advan¬ 
tages.  First,  we  can  visualize  the  inner  representation  of  the  form  of  visual  images. 
Secondly,  we  can  deal  with  incremental  data  by  updating  visual  queries  incrementally 
in  the  first  step.  Thirdly,  we  can  bring  the  result  from  content-based  image  retrieval 
(CB1R)  for  the  second  step.  In  addition,  after  generating  visual  queries  with  enough 
large  data,  we  expect  the  visual  queries  to  be  the  universal  visual  concepts  when  re¬ 
trieving  from  all  image  databases. 

Here,  wc  propose  a  novel  visual  query  expansion  method  that  simulates  the  capa¬ 
bility  of  human  associative  memory.  Hypernetwork  models  (HN)  have  cognitive 
properties  of  continuity,  glocality,  and  compositionality  [5].  And  HNs  learn  higher- 
order  cross-modal  association  to  solve  the  difference  of  granularity  in  image  and  text 
features.  HNs  can  be  appended  and  updated  partially  by  adding  new  hy  peredges  from 
new  observations  as  incremental  learning.  Especially,  we  built  a  visual  word  diction¬ 
ary  keeping  the  regional  information  from  an  image  beforehand.  This  enables  us  to 
visualize  the  visual  query  and  avoid  the  limitation  of  computational  complexity  for 
the  image  representation.  As  Fig.  1  shows,  1007  image-text  pairs  were  captured  from 
26  video  clips  of  Thomas  and  Friends.  And  we  simply  used  the  sum  of  absolute  dif¬ 
ference  in  RGB  scale  between  images  as  the  second  step 

This  paper  is  organized  as  follows.  Section  2  summarizes  related  works.  Then  hy- 
pernet works  will  be  introduced  briefly  in  Section  3  and  a  proposed  method  is  ex¬ 
plained  in  Section  4.  Section  5  shows  the  experimental  results.  Finally,  Section  6 
concludes  this  paper  with  concluding  remarks. 

2  Related  Work 

Crossmodal  data  retrieval  has  been  focused  on  the  information  retrieval  field,  as  a 
result  of  readily  available  multimedia  data.  Approaches  using  multimodal  data  have 
been  introduced  using  tagging  based  methods  such  as  automatic  tagging  and  annota¬ 
tion  and  statistical  dependency  based  methods  such  as  eo-occurrcnce  and  canonical 
correlation  analysis  (CCA)  [  1  -21.  And  approaches  using  image  annotation 


Fig.  1.  The  iraining  dataset  used  in  this  paper  The  pairs  from  one  clip  are  one  unit  of  instances 
for  sequeniial  presentalion  on  incremental  learning. 
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information  were  studied.  Jeon  et  al.  proposed  a  cross-media  relevance  model 
(CMRM)  [6]  using  annotated  images  and  grouping  small  blobs  of  images  manually. 
And  Pan  et  al.  studied  graph-based  methods  for  the  correlated  nodes  discovery  across 
other  modalities  [7j.  And  cross-modal  association  learning  has  been  applied  to  video 
data.  Yan  et  al.  studied  a  text-image  multimodal  retrieval  task  on  data  of  a  broadcast 
new  video  [9]  and  Snoek  et  al.  suggested  a  concept-based  video  retrieval  method  [8]. 
Additionally,  D.  Li  et  al.  proposed  a  factor  analysis  method  based  on  cross-modal 
association  [  10]. 

For  the  visual  query  expansion,  it  is  mainly  used  to  improve  the  performance  of  the 
retrieval  task.  Chum  et  al.  introduced  query  expansion  using  images  by  analogy  for 
the  text  retrieval.  They  used  images  as  added  queries  giving  spatial  constraints  and 
improved  the  retrieval  performance  for  false  negatives  [12]  Joly  et  al.  applied  this 
concept  to  logo  retrieval  in  large  image  collection  [13]  and  Jiang  et  al.  did  this  to  bag- 
of- visual-words  [14].  As  visual  representational  aspects,  a  visual  mental  imagery  is 
used  as  inner  representation  of  cognitive  processes  of  humans  [16],  Als  [  17]  and  even 
robots  [18]. 

In  [4],  Ha  et  al.  studied  the  image-text  cross-modal  retrieval  task  with  multimodal 
queries  based  on  pixels  of  the  gray  scale  on  the  Fixed  dataset.  On  the  contrary,  we  deal 
with  the  relevant  image  retrieval  task  based  on  incremental  HNs  with  color  image 
patches  on  the  increasing  dataset. 

3  Multi-modal  Hypernetwork  Models 

3.1  Hypernetwork  Models 

A  hypernetwork  (HN)  is  a  hypergraph  which  is  represented  with  vertices  and 
weighted  hyperedges.  Hypergraphs  refer  to  generalized  simple  graphs  by  allowing  for 
edges  of  higher  cardinality.  The  edges  in  a  hypergraph  are  called  hyperedges.  Fig.  2 
shows  an  example  of  HN.  In  formal  definition,  a  HN  is  defined  as  H  -  (V,  E,  W) 
where  V,  E  and  W  are  a  set  of  vertices,  hyperedges,  and  weights  respectively.  And  the 
elements  of  W  correspond  to  the  elements  of  E.  A  HN  is  formulated  on  the  basis  of 
probabilistic  theory.  Given  a  data  set  D  =  {x  n)  }*=l  of  N  samples,  the  HN  can  be 


N 


(1) 


where  Z(W)  denotes  the  partition  function  as  the  normalization  term  and  x(n) means  the 
n- th  instance  of  data.  And  £  is  the  energy  function  of  HN  and  the  partition  function 
arc  defined  as 


(3) 
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//  =  (V.E.  IV) 

V -  [vi.  v2.  vX . v7{ 

h  =  [El.  E2.  EX  E4  /;5} 

El  =  (v/.  \X  v4) 

E2  =  [vi,  W) 

F.t  =  [v2,  vX  x6\ 

E4  =  [vX  v4.  v6.  v7[ 

E5  =  { v4,  v5.  v7) 


E2 


E4 


E3 


Fig.  2.  An  example  of  a  hypemetwork.  Hypernctwork  H  is  composed  of  vertices  set  V,  hyper- 
edge  set  E  and  the  corresponding  weight  W. 


(  If! 


Z(M/)  =  Xe*P  -X-*vm£(x",,.£;j 


\  m- 1 


(4) 


where  u;u‘  is  a  positive  real-valued  weight  of  /- th  hyperedge  E,  and  <5>(x<n),  E,)  denotes 
the  identity  function  depending  on  input  parameter  elements  of  x and  hyperedge  £,. 
Taking  the  derivative  of  log-likelihood  function  of  (2),  we  can  derive  the  following 

In  P(D\W)  =  In  J^[P(.v(n)  I IV)  (5) 

n~\ 


And  minimizing  the  difference  between  two  average  frequencies  is  equivalent  to 
maximizing  the  likelihood  by  making  (6)  be  equal  to  zero  [5]. 

Then,  the  term 


I  El  n 

}/im  (7) 

m~\ 

can  also  be  derived  and  it  means  that  the  total  number  of  matching  hyperedges  with 
the  given  data  set  D  follows  the  average  frequencies  of  the  hyperedges  in  the  data  set. 

3.2  Cross-Modal  Associate e  Learning  on  Incremental  Hypernetwork  Models 

To  learn  eross-modal  associative  information,  we  create  cross-modal  hyperedges 
composed  exclusively  of  the  textual  part  and  visual  part,  which  are  sampled  from  text 
and  image  respectively,  as  shown  in  Fig.  3.  Formally,  given  an  instance  x=  {x/t  xr},  X/ 
is  the  feature  set  for  image  representation  and  \T  is  that  for  text  representation. 
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(8) 

* 

ii 

-4} 

(9) 

where  P  and  Q  are  the  number  of  features  for  images  and  text  respectively,  which 
means  the  size  of  visual  word  dictionary  and  linguistic  word  dictionary.  x‘k  and  Xj  are 
features  denoting  the  k- th  element  of  the  visual  word  dictionary  and  the  j-th  one  of  the 
text  word  dictionary  respectively.  Then  the  joint  distribution  given  arbitrary  weights 
from  (1)  can  be  converted  using  the  composition  of  hyperedges,  and  written  into  the 
formulation  taken  from  (7)  by  changing  the  weight  reflecting  the  number  of  matched 
instances  among  the  size  N  of  dataset. 

*  \f.i  in  . .  m 

/,(DIVT)  =  /3(0/,0rIVV)oc^^J(xlrt>,£,m)  =  ^Hm^(x,eMl)  [n,) 

«-1  m- 1  m=l 

where  D}  is  the  dataset  of  image  features  and  l)T  is  the  one  of  text  features.  Then,  the 
distribution  is  represented  by  weighted  nonzero  basis  functions  having  a  zero-one  binary 
value.  However,  all  of  the  possible  hyperedges  from  order  1  to  the  order  of  the  number 
of  total  features  is  almost  impossible  by  virtue  of  combinatorial  explosion  which  dic¬ 
tates  that  the  number  of  cases  will  massively  increase.  So,  we  should  approximate  this 
with  the  relatively  small  number  of  hyperedges  by  using  random  sampling  strategy.  We 
can  approximate  the  joint  distribution  using  M  hyperedges  like  this  formula, 

in  m  /  <  i  \ 

P(D„Dr  I W)  -  Z*-mS(x.Em)  =  £  «•„,<?>(  x.I-J  Ut> 

m- 1  m-1 

if  M  is  large  enough  to  express  the  distribution,  the  error  between  the  estimation  result 
and  the  distribution  will  be  decreased.  By  this  fact,  we  can  estimate  the  distribution 
roughly  by  simply  using  a  reasonably  small  number  of  hyperedges. 

-  Multimodal  data  from  videos  - 
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Fig.  3.  An  example  of  cross-modal  hyperedges  using  the  visual  word  dictionary.  For  the  ex¬ 
periments,  tri-grant  is  used  for  the  sampling  from  text  part  and  image  patches  random  sampled 
among  92  regions  on  the  grid  for  image  part. 
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Fig.  4.  The  How  chart  for  cross-modal  associative  learning.  The  top  shows  the  case  for  the 
fixed  dataset  and  the  bottom  shows  that  for  the  incremental  dataset. 


For  incremental  HN  learning,  we  can  easily  apply  the  same  strategy  with  a  small 
adjustment.  Formally,  we  define  the  preliminary  dataset  as  D0  and  the  /z-th  new1  data¬ 
set  as  /)„+/.  Then,  the  /z-th  accumulated  training  set  [)ln+h  can  he  written  as  follows: 

O"*1'  =  o,rt,uOH+t  02) 

Whenever  there  is  an  inflowing  new  dataset,  adding  new  hyperedges  from  it  by  ran¬ 
dom  sampling  strategy  can  maintain  the  small  error  between  the  estimation  and  the 
distribution  while  keeping  the  condition  that  the  number  of  hypedges  is  enough  to 
follow.  The  proeess  is  summarized  in  Fig.  4. 


4  A  Visual  Query  Expansion  Method 

4.1  Building  a  Visual  Word  Dictionary  for  Image  Patches 

Visual  query  expansion  needs  image  processing  for  using  visual  features.  Avoiding 
the  vast  computational  complexity  on  the  image  representation,  we  built  a  visual  word 
dictionary  ineluding  10,000  visual  words  beforehand.  This  process  is  illustrated  in 
Fig.  5.  As  image  preprocessing,  each  image  is  firstly  segmented  into  15x15  square 
image  patches  on  a  regular  grid  shown  in  the  second  image  in  Fig.  3.  Following  the 
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work  of  Feng  et  at.  [15],  using  the  rectangular  regions  could  provide  performance 
gains  compared  with  using  regions  by  automatic  image  segmentation  methods.  We 
were  also  able  to  avoid  the  problems  associated  with  the  computational  cost.  Sec¬ 
ondly,  we  assigned  all  of  the  segmented  patches  into  k  groups  by  k-means  clustering 
in  the  RGB  color  space  using  Koen’s  image  processing  package  [1 1].  As  a  result,  we 
made  10,000  visual  words  by  choosing  the  closest  visual  word  from  the  centroid  of 
each  cluster.  This  set  of  image  patches  worked  as  visual  words  in  this  paper. 


4.2  Visual  Query  Expansion  by  Combining  Image  Patches 

Expanded  Visual  query  can  be  created  by  the  following  process.  When  given  the 
linguistic  cue  which  works  as  the  condition  on  the  (10),  we  can  make  inference  with 
the  trained  HN  by  the  following  formula 

P(DrDr  \W)  ^ 

P(D.  1  AX  ,W)  = - °c  Y  wmS(x,Em)  (13) 

p(dt\v/)  irr 

where  the  set  ETq  of  cross-modal  hyperedges  including  the  text  Tq.  Then,  we  choose 
the  index  X*}of  visual  word  that  makes  conditional  likelihood  be  the  maximum  at  the 
j-th  region  on  the  grid  as  follows: 

=argmax/,(// =/(jf|,)l/>r  ,\V)  =  argmax  \vmS(x,Em)  (14) 

P  P  me  ET)t 

where  /is  the  mapping  function  to  the  visual  word.  And  combining  them  generates 
visual  query.  This  process  can  be  achieved  on  HNs  by  choosing  the  visual  word  that 
maximum  weight  of  hyperedges  whieh  are  relevant  to  the  text  Tq  as  in  the  following 
summarized  procedure. 


Each  image  is  segmented  into 
92  15x15  square  image  patches. 


Assigning  visual  words 
on  a  regular  grid 


Choose 

the  closest  visual  word 
from  the  centroid 
of  each  clusters 
as  visual  word  dictionary 


k-means  clustering 
with  image  patches 
from  all  images 


Fig.  5.  The  process  to  build  a  visual  word  dictionary  and  to  convert  original  images  into  ones  to 
be  trained.  All  of  the  image  patches  segmented  are  grouped  into  10,000  clusters  and  converted 
by  the  closest  visual  word  from  original  image  patches. 


Visual  Query  Expansion  via  Incremental  Hypemetwork  Models  of  Image  and  Text 


95 


1 .  Summing  up  the  weights  of  hyperedges  having  the  text  Tq. 

2.  Choose  the  index  of  visual  word  that  make  conditional  likelihood  be  maximum 
at  the  j- th  region  on  the  grid. 

3.  Combining  the  image  patches  with  the  corresponding  index  at  the 7-th  region. 

5  Experimental  Results 

5.1  Data  and  Experimental  Setups 

As  mentioned  briefly  in  Section  1  and  Fig.  1,  we  captured  1007  image-text  pairs  from 
26  video  clips  of  Thomas  and  Friends  season  1.  We  used  a  capture  tool  to  collect 
image-text  pairs  automatically  whenever  a  subtitle  appeared.  Table  1  shows  the  dis¬ 
tribution  across  26  video  clips.  And  the  experimental  setting  is  shown  in  Table  2. 

5.2  Experimental  Results 

During  the  incremental  learning.  HNs  were  trained  in  sequence  and  retrieved  top-N 
closest  images  using  the  sum  of  absolute  difference  in  RGB  scale  between  the  gener¬ 
ated  visual  query  and  original  images  to  perform  an  image-to-image  retrieval  task. 
Fig.  6  and  Fig.  7  show  the  results  of  image  order  5  and  order  35  each  when  the  cue 
‘engine'  is  given.  Then,  shown  in  order,  are  the  generated  visual  query,  the  closest 
top-5  images  near  the  visual  query  in  that  dataset  Dn  and  all  of  the  original  images 
associated  with  the  cue  in  D„.  The  associated  original  images  arc  the  same,  but  the 
generated  visual  queries  are  rather  different,  which  cause  the  top-5  retrieved  images  to 
also  be  different.  They  include  some  original  images  (10/23  cases  of  nonzero  original 
images,  10/45  in  total).  The  visual  queries  generated  from  5-order  HNs  are  more 
flexible  to  incoming  new  instances  than  those  by  35-order  HNs.  (25  consecutive  dif¬ 
ference  £  I/,  -  I,  i\)  per  pixel:  a  =  1 8.7  <  a,5  =  26.6,  m5=l 7. 1  1=5  11135=1 8.7). 

In  more  than  2  words  cases,  the  visual  query  can  be  generated.  Fig.  8  shows  a  com¬ 
parison  between  the  ease  given  text  cues  ‘noise’  and  ‘once'  simultaneously  and  each. 
Though  they  are  generated  from  the  same  HN  model,  at  each  ease,  they  reflect  the 
original  images  well,  and  the  2  words  case  do  also.  Even  though  there  is  no  instance 


Table  1.  The  frequencies  of  instances  in  the  incremental  dataset 


I  2  3  A  5  6  7  8  9  10  II  12  1.1  14  15  14  17  IK  If  20  21  22  23  2  4  25  26  total 

40  40  \S  >9  45  V,  IS  AX  44  46  41  3X  42  40  Vt  AX  311  37  25  IK  41  39  30  46  45  AO  1007 


Table  2.  The  information  and  parameter  set  for  the  experiments 


Information 

Values 

Parameters 

Values 

Total  data 

1007  in  26  sets 

Text  order 

3  (tri-gram) 

Total  text  words 

1256 

Image  order 

5. 35 

Number  of  regions  on  1  image 

92 

Sampling  rate 

10 

Number  of  visual  words 

10.000 

Image  patch  size 

15  *15 

90 
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Fig.  6.  An  example  in  sequential  presentation  from  top  left  to  top  right,  and  bottom  left  to 
bottom  right  It  shows  the  generated  visual  queries,  the  related  original  images  and  the  retrieved 
top-5  images,  (image  order:  5,  linguistic  cue:  engine). 

having  ‘noise'  and  ‘once’  together  (not  even  in  the  same  dataset),  the  visual  query 
with  mixed  two  cases  can  emerge  when  given  the  cues  ‘noise’  and  ‘once’  together. 
This  point  is  important  if  the  amount  of  data  is  very  large,  because  one  text  can  have 
the  visual  concept  each,  which  they  can  work  as  additive  prototypes. 

The  result  of  the  overall  retrieval  performance  is  summarized  in  Table  3.  It  is  done 
by  checking  whether  more  than  one  original  image  is  retrieved  for  each  linguistic  cue 
in  text  dictionary  during  the  incremental  learning.  If  the  large  portion  of  the  corpus  is 
sparse,  unsupervised  learning  methods  confront  the  difficulty  of  learning  the  specific 
information  for  the  discrimination.  To  show  general  characteristics  of  performance, 
we  may  ignore  the  cases  of  low  frequencies.  As  a  result,  then,  we  get  higher  accuracy 
to  retrieve  relevant  images. 
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Fig.  7.  An  example  in  sequential  presentation  from  top  left  to  top  right,  and  bottom  left  to 
bottom  right.  It  shows  the  generated  visual  queries,  the  related  original  images  and  the  retrieved 
top-5  images,  (image  order:  35,  linguistic  cue:  engine). 
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Fig.  8.  An  example  of  mixed  words  giving  the  cues  ‘noise' 

and  ‘onee’  by  sequence  1 3.  Even 

though  they  do  not  occur  in  the  same  instances,  generated  visual  query  reflects  the  original 
images  together  well.  The  reason  for  blaek  visual  queries  in  the  left  part  eomes  from  presenting 
no  instance  to  learn  ‘noise'  and  ‘onee'  yet. 
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Tabic  3.  The  overall  performance  of  retrieval  results  in  various  tasks  (order:  35) 


Retrieval  task 

#  of  cases 

Size  of  retrieved  candidates  (Top-N) 

3  5  8 

10 

All  cases 

Successful  cases 

1256 

334 

462 

617 

692 

Percentage  (%) 

26.6% 

36.8% 

49.1% 

55  1% 

Cases  of  freq.  >=  3 

Successful  cases 

528 

253 

338 

414 

438 

Percentage  (%) 

47.9% 

64.0% 

78.4% 

83.0% 

Cases  of  freq.  >=  5 

Successful  cases 

380 

215 

286 

336 

.343 

Percentage  (%) 

56.6% 

75.3% 

88.4% 

90.3% 

Cases  of  freq  >=  7 

Successful  cases 

288 

187 

237 

270 

272 

Percentage  (%) 

64.9% 

82.3% 

93.8% 

94.4% 

Cases  of  freq  >=  10 

Successful  cases 

208 

154 

190 

203 

204 

Percentage  (%) 

74.0% 

91.4% 

97.6% 

98.1% 

6  Concluding  Remarks 

We  separated  text-to-image  retrieval  task  into  two  steps  as  follows:  1)  text-to-image 
conversion  and  2)  image-to-image  retrieval.  And  wc  proposed  a  method  to  generate 
visual  query  based  on  cross-modal  associative  learning  by  incremental  hypemetwork 
models  with  the  focus  on  the  text-image  conversion  reflecting  the  related  images  from 
an  image-text  corpus.  Experimental  results  show  that  the  visual  query  generated  by 
this  method  can  be  used  for  the  image-to-imagc  retrieval  task.  In  this  study,  we  just 
estimate  with  the  small  number  of  bases  of  the  specific  order  (k-order  hyperedges) 
without  explicit  learning  process.  We  will  go  on  to  establish  proper  learning  processes 
with  unsupervised  HNs  and  apply  proper  CB1R  methods  to  the  second  step. 
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Abstract.  Probabilistic  models  are  widely  used  in  evolutionary  and  related  algo- 
riihms.  In  Genelic  Programming  (GP),  the  Probabilistic  Proiotype  Tree  (PPr)  is 
often  used  as  a  model  representation.  Drift  due  to  sampling  bias  is  a  widely  recog¬ 
nised  problem,  and  may  be  serious,  particularly  in  dependent  probability  models. 
While  lhis  has  been  closely  studied  in  independeni  probability  models,  and  more 
recently  in  probabilistic  dependency  models,  it  has  received  little  attention  in  sys- 
lems  with  strict  dependence  between  probabilistic  variables  such  as  arise  in  PPT 
representation.  Here,  we  investigate  this  issue,  and  present  resulls  suggesling  thai 
the  drift  effecl  in  sueh  models  may  be  particularly  severe  -  so  severe  as  lo  casi 
doubt  on  their  scalability.  We  present  a  preliminary  analysis  lhrough  a  facior  rep¬ 
resentation  of  the  joint  probability  distribution.  We  suggest  fulure  directions  for 
research  aiming  to  overcome  this  problem. 


1  Introduction 

A  wide  range  of  evolutionary  algorithms  learn  explicit  probability  models,  sampling 
individuals  from  them,  using  the  fitness  of  individuals  to  update  the  model.  They  range 
from  Colorni  and  Dorigo's  Ant  Colony  Optimization  (ACO)  [1  ]  and  Baluja’s  Popula¬ 
tion  Based  Incremental  Learning  (PBIL)  [2]  through  Muehlenbein  and  Manig’s  Factor¬ 
ized  Distribution  Algorithm  (FDA)  [3]  or  Pelikan’s  Bayesian  Optimization  Algorithm 
(BOA)  [4]  to  Salustowiez  and  Sehmidhuber’s  Probabilistic  Incremental  Program  Evolu¬ 
tion  (PIPE)  [5],  Historically,  different  strands  of  this  research  have  developed  in  relative 
isolation,  and  there  is  no  acknowledged  single  term  to  describe  them.  In  this  paper,  w'e 
refer  to  sueh  algorithms  as  Estimation  of  Distribution  Algorithms  (EDAs),  acknow  ledg¬ 
ing  that  this  may  be  widcr-than- normal  usage. 

When  EDAs  are  applied  to  Genetic  Programming  (GP)  [6]  problems,  the  most  ob¬ 
vious  question  is  w  hat  statistical  model  to  use  to  represent  the  GP  solution  spaee,  and 
how  to  learn  it.  This  question  has  draw  n  most  of  the  attention  of  researchers  in  this  field, 
w  ith  consequent  neglect  of  the  sampling  stage  of  EDA-GP  algorithms. 

In  GP,  many  EDAs  have  used  variants  of  the  Probabilistic  Prototype  Tree  (PPT) 
as  their  proability  model,  beginning  with  PIPE  [5]  and  extending  to  Yanai  and  Iba’s 
EDP  17],  Sastry  et  al.'s  ECG)  [8],  Hasegawa  and  Iba\s  POLE  [9],  Looks  et  al.’s  BOAP 
[  10]  and  Roux  and  Fontupt’s  Ant  Programming  [11].  The  PPT  is  a  convenient  model  for 
representing  probability  distributions  estimated  from  tree  individuals.  However 
Hasegawa  and  Iba  already  noted  that  it  suffers  from  some  representational  problems, 
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and  proposed  the  Extended  Parse  Tree  (EPT)  variant  [9|.  What  has  not  been  studied  is 
the  effect  on  sampling  drift  of  its  implicit  dependence  model. 

Sampling  drift  effect  is  an  important  problem  for  all  probability  models.  However  the 
strict  probability  dependence  in  the  PPT  greatly  amplifies  this  effect  relative  to  the  other 
major  sources  of  bias  in  EDAs  (selection  pressure  and  learning  bias),  thus  becoming  a 
critical  issue  in  scaling  of  PPT- based  EDAs  to  large-scale  problems. 

In  this  paper,  wc  examine  this  problem  both  empirically  and  mathematically.  We 
designed  two  simple  problems,  closely  related  to  the  well-known  one-max  and  max 
problems,  with  simple  fitness  landscapes  to  reduce  the  effects  of  other  factors.  Wc  com¬ 
pare  the  behaviour  of  a  PIPE  model  with  a  PBlL-style  independent  model  to  illustrate 
the  amplified  effect  of  sampling  bias.  We  mathematically  investigate  how  the  factorised 
distribution  implicit  in  the  PPT  model  causes  this  increased  sampling  bias. 

In  section  2,  we  present  a  brief  overview  of  EDAs  and  of  PPTs..  The  experiments 
arc  described  in  section  3,  with  their  results  following  in  section  4.  Section  5  analyse 
the  factorisation  implicit  in  the  PPT.  We  discuss  the  implications  of  these  results  in 
section  6,  drawing  conclusions  and  proposing  future  directions  in  section  7. 


2  Background  Knowledge 

2.1  Estimation  of  Distribution  Algorithms 

EDAs  arc  evolutionary  algorithms  incorporating  stochastic  models.  They  use  the  key 
evolutionary  concepts  of  iterated  stochastic  operations  as  shown  below: 
generate  N  individuals  randomly 
while  not  termination  condition  do 

Evaluate  individuals  using  fitness  function 
Select  best  individuals 

Construct  stochastic  model  from  selected  individuals 
Sample  new  population  from  model  distribution 

end  while 

They  differ  from  a  typical  evolutionary  algorithm  only  in  model  construction  and  sam¬ 
pling.  All  EDAs  use  some  class  M  of  probability  models,  and  a  corresponding  decom¬ 
position  of  the  structure  of  individuals.  Model  construction  specifies  a  model  from  M 
for  each  component.  Sampling  a  new  individual  traverses  the  components,  sampling  a 
value  from  each  model,  so  that  the  sample  component  distribution  reflects  the  model's. 
In  the  simplest  version,  PB1L,  the  probability  model  is  a  vector  of  independent  proba¬ 
bility  tables,  one  for  each  location  of  the  phenotype. 

2.2  Probabilistic  PrototypeTrees  and  EDAs 

PPT-based  EDAs  use  a  tree  structure  to  store  the  probability  distribution  Given  a  pre¬ 
defined  instruction  set  of  maximum  arity  n,  the  PPT  is  an  /?-ary  full  tree  storing  a 
probability  table  over  the  set  of  instructions.  PPT  was  first  used  in  PIPE  [5],  where  each 
node  contained  an  independent  probability  table.  ECGP  [8]  extended  this  by  modelling 
dependence  between  PPT  nodes  as  in  the  Extended  Compact  Genetic  Algorithm  [  12]. 
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EDP  [7]  instead  conditioned  each  node  on  its  parent.  BOAP  [10]  learnt  Bayesian  net¬ 
works  (BN)  of  dependences  in  the  PPT,  while  POLE  [9]  learnt  BNs  representing  de¬ 
pendences  in  an  ’’Extended  Parse  Tree”,  a  variant  of  the  PPT. 

2.3  Benchmark  Problems 

One  Max  is  the  near-trivial  problem  of  finding  a  fixed-length  binary  string  maximising 
the  sum  of  all  bits  [13].  Its  fitness  landscape  is  smooth  with  no  local  optima.  Thus  it 
is  well-suited  to  the  PBIL  independent-probability  model,  using  a  probability  vector 
V  =  [E\ , . . . ,  En\  over  the  value  set  {0,  1 }  to  represent  the  locations  in  the  string. 

The  Max  Problem  is  a  generalisation  of  one-max,  where  the  goal  is  to  find  the  largest- 
valued  tree  that  can  be  constructed  from  a  given  function  set  I  and  terminal  set  T,  in  a 
given  depth  D  [14].  Typically  /  =  {x.  +}  and  T  =  {0.5}.  This  appears  well-suited 
to  the  ’’independent”  probability  model  of  PIPE,  in  that  each  node  of  the  PPT  -  in  this 
case,  a  full  binary  tree  -  holds  an  independent  probability  table,  giving  the  probability 
of  selecting  each  element  of  I  U  T.  The  simplest  case  of  max,  /  =  {  +  },  T  =  {0, 1 } 
is  closely  related  to  one-max,  in  that  once  the  system  has  found  a  full  binary  shape,  the 
remaining  problem,  of  filling  the  leaves  with  1,  is  essentially  the  one-max  problem.  We 
note  that  in  making  this  comparison,  we  are,  in  effect,  mapping  the  nodes  of  the  PPT 
tree  to  corresponding  locations  in  a  PBIL  chromosome. 

2.4  Grammar  Guided  Genetic  Programming 

To  set  the  context  for  this  study,  we  compare  the  performance  of  GP  on  the  same  prob¬ 
lems;  we  can’t  use  a  standard  GP  system  for  this,  because  it  is  unable  to  enforce  the 
constraints  of  the  one-max  problem.  For  fair  comparison,  we  use  a  Grammar  Guided 
GP  system  (GGGP)  [151. 

3  Experimental  Analysis 

Our  experiments  illuminate  sampling  drift  in  PPT-based  EDAs,  comparing  it  with  a 
well-understood  model  (PBIL).  We  need  to  specify  four  aspects: 

1 .  the  probability  model  structures 

2.  the  fitness  functions 

3.  the  EDA  algorithm 

4.  experimental  parameters 

To  illustrate,  we  use  the  max  problem,  and  a  slight  variant  of  one-max,  with  the  same 
target  as  max  (but  a  more  one-max-like  fitness  function).  We  compare  with  a  conven¬ 
tional  GGGP  approach  to  show  the  intrinsic  simplicity  of  these  problems.  For  economy 
of  explanation,  we  describe  the  max  problem  first. 

3.1  Model  Structures 

The  Genotype  Representation  is  a  15-long  string  X  —  X\ . AYs-  This  can  be  used 

in  either  of  two  ways:  the  string  can  be  modelled  through  an  independent,  PBIL-style 
genotype,  or  it  can  be  mapped  to  a  binary  PPT  of  depth  3  (which  has  15  nodes). 
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In  the  PBIL  structure  each  location  contains  an  independent  probability  table  with 
three  possible  values,  4-,  x  and  0.5.  The  table  is  used  to  generate  sample  values  at  eaeh 
generation,  then  is  updated  to  reflect  the  sample  distribution  of  the  selected  individuals. 

In  the  PPT  structure  each  location  contains  an  independent  probability  table  over  the 
values  4- ,  x  and  0.5,  but  eaeh  (except  the  leaves)  has  two  children,  with  the  relationship: 

left  child  (Xi)  =  XiX 2 

righ  ehild  (AY)  =  A',*2+i 

"Independence"  in  the  latter  case  must  be  taken  with  a  grain  of  salt.  While  the  prob¬ 
ability  tables  in  the  PBIL  structure  are  independent,  the  PPT  structure  introduces  a  de¬ 
pendence:  the  descendants  of  a  node  holding  the  (terminal)  value  0.5  are  not  sampled. 
This  is  the  primary  issue  under  consideration  here. 


3.2  Max  Problem  Fitness  Function 


Fitness  is  defined  by  the  following  equation: 

itFit  (left  ehild  (A',))  x  itFit (right  ehild  (A',-)) 
itFit  (left  ehild  (A,))  4-  itFit  (right  ehild  (A'*)) 
itFit  (Xi)  =  0.0 

0.0 

0.5 


itX;  =  X.  1  <  /  <  7 
if Xi  =  4-,  1  <  i  <  7 
if Xi  =  x,8  <  i  <  15 
ifXi  =  +,8  <  ■/  <  15 
iiXi  =  0.5 


When  4-,  x  were  used  in  leaf  nodes,  there  is  a  problem  in  allocating  fitness,  since  they 
have  no  children.  To  overcome  this,  in  this  case  we  give  them  fitness  0.  The  maximum 
value  of  this  function  (the  target)  corresponds  to  a  full  binary  tree  with  4-  in  the  bottom 
two  layers,  and  4-  or  x  in  the  top  layer. 

3.3  Variant  One-Max  Problem  Fitness  Function 


The  task  is  to  find  a  string  having  a  speeilic  value  in  each  location,  defined  by  dividing 
the  locations  into  three  groups,  as  in  equations  1. 

Lx  =  {AY} 

L-2  =  {A',}  2  <  i  <  7 

L3  =  {AY}  8  <  »  <  15  (I) 


In  this  case,  the  fitness  function  is  given  by  equation  2: 

omFit  (  V)  =  ]>4=i  locFit  (AY)  (2) 


w;here 


loeFit(At) 


1  if  Xt  —  x  and  X1  E  L\ 

1  if  Xi  —  4-  and  A',  €  L2 

1  if  Xi  =  0.5  and  Ar,  E  /  3 

0  else 


This  differs  from  the  typical  one-max  problem  in  two  ways:  there  are  three  possible 
values,  not  two,  and  target  values  at  differ  with  location  However  neither  makes  much 
difference  to  the  fitness  landscape,  w  hich  remains  smooth,  with  no  local  optima. 


104 


K.  Kiin,  B.  (R.I.)  McKay,  and  D,  Punithan 


3.4  EDA  System 

In  these  comparisons,  we  use  a  very  simple  EDA  system  so  that  the  implications  of  the 
experiments  are  clear.  In  detail: 

Selection:  truncation.  Given  a  selection  ratio  A,  the  top  A  proportion  of  individuals  are 
selected.  We  varied  the  selection  ratio  A  to  investigate  the  effect  and  scale  of  drift. 

Model  Update:  the  model  structure  was  fixed  for  the  whole  evolution.  Maximum  like¬ 
lihood  was  used  to  estimate  the  probabilities  from  the  selected  sample. 

Sampling:  we  used  Probabilistic  Logie  Sampling  [16],  the  most  straightforward  sam¬ 
pling  method,  used  in  most  EDA-GP  systems. 

To  simplify  understanding,  two  common  EDA  mechanisms  which  can  slow  drift, 
elitism  and  mutation,  were  omitted  from  the  system 

3.5  Parameter  Settings 

We  used  truncation  selection  with  selection  ratios  ranging  from  10%  to  100%  at  a  10% 
interval.  The  population  size  was  100,  and  the  algorithm  was  run  for  200  generations. 
Each  setting  was  run  30  times.  Detailed  parameters  settings  for  the  GGGP  and  EDA- 
GP  runs  are  shown  in  table  1,  while  the  grammar  used  for  GGGP  (with  starting  symbol 
EXPi)  is  shown  in  table  2. 


Tabic  I.  Experimental  Parameter  Settings 


General 

Parameters 

Value 

EDA 

Parameters 

Value 

GGGP 

Parameters 

Value 

Genotype 

Operators 

Operators 

Length 

15 

Selection 

Truncation 

Selection 

Tournament 

Values 

+,  x .  0.5 

Ratios 

0.1,..., 1.0 

Size 

5 

Update 

Max.  Likelihood 

Cross,  prob. 

0.5 

Sampling 

PLS 

Mut.  proh. 

0.75 

Population 

50 

Dependence 

Reproduction 

Generational 

Generations 

200 

PB1L 

independent 

Runs 

30 

PPT 

PPT 

Table  2.  GGGP  Grammar 

EXP,  —  EXPt+i  OP  EXP, 4-1  (0  <  i  <  4) 

EXPi  —  OP 
OP  — ►  +|  x  |0.5 

4  Result  of  Preliminary  Experiments 

4.1  One-Max  Results 

Figure  1  shows  the  performance  of  the  two  probability  models,  at  various  levels  of  se¬ 
lection.  Each  plot  shows  a  particular  structure  for  a  range  of  different  selection  ratios. 
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Fig.  1.  Best  Fitness  vs  Generation  for  One-max  Variant  (Structure  :  left,  PBIL ,  right,  PPT\ 
percentage  is  the  selection  ratio) 


Each  line  represents  the  best  fitness  achieved  in  each  generation,  for  a  particular  selec¬ 
tion  ratio.  By  comparison,  GGGP  finds  perfect  solutions  in  14.3  ±  4.9  generations. 

We  note  that  even  for  this  near-trivial  fitness  function,  PPT  shows  worse  performance 
than  PBIL.  In  the  left-hand  plot,  the  PHIL  structure  finds  a  solution  close  to  the  optimum 
(15)  at  most  selection  ratios  other  than  90%  and  100%  (i.e.  no  selection).  These  results 
are  replicated  for  the  selection  ratios  not  plotted,  most  showing  performance  very  close 
to  the  optimum,  as  with  the  40%  selection  ratio.  By  comparison,  the  PPT  model  shows 
much  worse  performance.  In  all  selection  ratios,  PPT  converges  to  sub-optimal  solu¬ 
tions.  The  difference  increases  with  weaker  selection,  with  the  100%  ratio  showing  a 
substantial  decrease  in  fitness,  below  that  achieved  by  random  sampling.  With  selection 
pressure  turned  off,  this  drift  is  the  result  purely  of  sampling.  With  increasing  selectiv¬ 
ity,  the  drift  effect  becomes  weaker,  but  still  acts  counter  to  the  selection  pressure. 

4.2  Max  Problem 

This  problem  is  much  tougher  than  the  previous.  GGGP  finds  perfect  solutions  in  17.8dL 
8.0  generations.  However  EDA  performance  fares  far  worse.  The  PBIL  model  is  unable 
to  find  the  optimum  solution  (4)  at  any  selection  ratio,  and  the  differences  from  the 
optimum  are  larger  than  for  one-max.  Given  that  the  fitness  function  has  epistasis,  which 
PBIL  is  unable  to  model,  this  is  not  surprising.  What  is  surprising  is  the  even  poorer 
performance  of  the  PPT  model.  PPT  appears  well-matched  to  the  fitness  function,  yet 
performs  much  worse  than  the  naive  PBIL  model.  PBIL  is  able  to  achieve  fitnesses, 
for  some  selection  ratios,  of  around  3.4,  whereas  PPT  never  exceeds  2.7.  the  effects 
are  particularly  marked  around  selection  ratios  from  10%  through  to  60%,  with  the 
differences  becoming  weaker  by  80%  to  90%,  and  essentially  disappearing  at  a  100% 
selection  ratio. 

4.3  Performance  of  PPT 

Overall,  we  sec  poor  performance  from  the  PPT  model  for  both  simple  and  complex 
problems.  Even  for  the  max  problem  -  the  kind  of  problem  that  PPT  was  designed  to 
solve  -  it  shows  much  worse  performance  than  PBIL.  The  behaviour  under  100%  se¬ 
lection  -  i.e.  pure  sampling  drift  -  suggests  a  possible  cause:  that  sampling  drift  [  17] 
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Generation 


Generation 


Fig.  2.  Best  Fitness  vs  Generation  for  Max  (Structure  :  left,  PB1  Ls  right,  PPT ,  percentage  is 
the  selection  ratio) 


may  be  the  major  influence  on  pcformance.  The  poor  performance  on  the  trivial  fit¬ 
ness  landscape  of  the  onc-max  variant  supports  this.  The  good  performance  of  GGGP 
emphasizes  just  how  damaging  this  effect  is. 


5  Analysis  of  the  PPT  Model 

5.1  The  Effects  of  Arity 

In  a  PPT,  each  node  represents  a  random  variable  which  can  take  any  of  the  possible 
instructions  as  its  value.1  Table  3  shows  a  typical  example  for  the  case  of  symbolic 
regression,  with  a  function  set  consisting  of  the  four  binary'  arithmetic  operators,  four 
unary  trigonometric  and  exponential  operators,  and  a  variable  and  constant,  of  arity  0. 


Table  3.  PPT  Table  for  Symbolic  Regression,  Showing  Arities 


Instruction 

Arity 

Probability 

Instruction 

Arity 

Probability 

+ 

2 

0.1 

sin 

1 

0.1 

X 

2 

0.1 

cos 

1 

0.1 

- 

2 

0.1 

log 

1 

0.1 

/ 

2 

0.1 

exp 

1 

0.1 

X 

0 

0.1 

C 

0 

0.1 

The  combining  of  nodes  of  different  arities  in  the  PPT  model  creates  a  dependence 
relationship  between  parent  and  child  nodes,  even  though  their  probability  distributions 
appear  to  be  separate.  If  a  node  rcj  is  sampled  as  sin,  one  of  the  child  nodes  -  conven¬ 
tionally  773  -  loses  the  opportunity  to  sample  an  instruction.  Therefore  the  probability  of 
sampling  77,3  is  different  from  that  of  772 ,  the  other  child  node.  Thus  although  the  prob¬ 
ability  distribution  of  773  is  independent  of  the  condition  set  of  rii,  713  is  nevertheless 

1  Nodes  at  the  maximum  depth  are  only  permitted  values  of  zero  arity,  but  for  the  sake  of  sim¬ 
plicity  wc  omit  this  from  consideration  here. 
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dependent  on  the  complete  condition  set  of  // 1 ,  because  the  probability  of  sampling  an 
instruction  for  n is  0  in  the  case  where  a  unary  function  or  variable  is  sampled  at  raj. 

To  clarify  this  dependency,  we  transform  the  PPT  probability  distribution  to  a  semi¬ 
degenerate  Bayesian  network.2 


5.2  Conversion  to  Semi-degenerate  Bayesian  Network 

Undefined  Instruction.  In  the  PPT,  each  node's  probability  table  cannot  be  directly 
treated  as  a  random  variable,  because  the  probability  distribution  for  some  conditions 
of  the  parent  is  not  recorded  in  the  table.  To  cover  this  case,  where  a  node  can  not  select 
any  value,  we  define  an  additional  value  U ,  for  'undefined  value'.  Taking  a  simple  ease 
with  just  three  values,  -f,  sin  and  C,  an  independent  PPT  might  have  probabilities  of 
0.4  for  +  and  sin,  and  0.2  for  C.  Taking  account  of  the  parent-child  dependencies,  we 
could  represent  the  overall  conditional  dependency  of  a  random  variable  for  a  node 
given  its  parent,  as  in  figure  3.  In  the  parent  node  of  3/4,  any  of  -K  sin,  C  or  U  might 
be  sampled.  When  C,  constant,  is  sampled,  A/4  is  not  able  to  sample  any  value,  so  that 
the  probabilities  for  selecting  -f ,  sin  and  C  are  zero;  to  represent  that  no  instruction  can 
be  sampled  in  this  condition,  we  allocate  the  'undefined'  instruction  a  probability  of 
1.0.  If  the  parent  node  is  sampled  as  'undefined',  M4  must  also  be  undefined. 


M4 


+ 

sin 

c 

u 

+ 

0.4 

0.4 

0.0 

0.0 

sin 

0.4 

0.4 

0.0 

0.0 

c 

0.2 

0.2 

0.0 

0.0 

u 

0.0 

0.0 

1.0 

1.0 

Fig.  3.  Transformed  Probability  table  of  PPT 


Figure  4  shows  more  detail,  illustrating  how  a  simple  three-node  PPT  can  be  trans¬ 
formed  into  a  (semi-degenerate)  BN.  Note  that  the  probability  structures  of  the  left  and 
right  children  differ  (because  of  the  differing  effects  of  the  sin  function  in  the  parent). 

5.3  Factorization  of  Full  Joint  Distribution 

Dependent  Variable.  In  the  resulting  BN,  the  translormed  nodes  become  conditionally 
dependent  on  their  parent  nodes  (there  are  only  two  exceptions  -  either  the  node  is 
always  undefined,  hcncc  unreachable  and  may  be  omitted  from  the  PPT,  or  else  the 

In  standard  terminology,  tables  without  zeros  are  said  to  be  non-degenerate,  and  tables  con¬ 
taining  only  0.0  and  1.0  are  degenerate.  We  introduce  ihe  term  'semi -degenerate'  for  the  inter¬ 
mediate  case,  of  tables  containing  0.0  but  not  necessarily  1 .0. 
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PPT 


Transformed  PPT 


+ 

0,4 

sin 

04 

c 

oT 

(+,  s(n)y 

M , 


+ 

0.4 

sin 

0.4 

C 

0.2 

+) 


M, 


+ 

0.3 

sin 

0.3 

c 

0.4 

+ 

sin 

C 

+ 

0.4 

0.4 

0.0 

sin 

0.4 

0.4 

0.0 

C 

0.2 

0.2 

0.0 

u 

0.0 

0.0 

1.0 

\ 

c 

m3 

+ 

sin 

c 

+ 

0.3 

0.0 

0.0 

sin 

0.3 

0.0 

0.0 

C 

0.4 

0.0 

0.0 

U 

0  0 

1.0 

1.0 

Fig.  4.  Transformation  from  PPT  to  semi-degenerate  BN 


node  is  always  defined,  implying  that  the  parent  node  cannot  sample  a  terminal,  an 
unreasonable  situation  in  CiP  -  both  may  he  safely  ignored). 

In  the  simplest  PPT  case,  where  each  node’s  value  is  assumed  probabilistically  in¬ 
dependent  of  the  other  nodes,  the  only  dependence  is  that  arising  above.  That  is,  this 
simple  case  corresponds  to  the  assumption  that  each  node  is  conditionally  independent 
of  all  other  nodes  in  the  PPT,  conditioned  only  on  its  parents.  Thus  the  probability  dis¬ 
tribution  of  node  x  can  be  represented  by  ;)(x|parent  of  :r),  and  the  full  joint  probability 
distribution  of  the  transformed  PPT  as: 

/>(  A  )  =  J  |  p(Xi  (^parent  of  i)  (3) 

t 

Of  course,  more  complex  dependencies  between  PPT  nodes  may  give  rise  to  more  com¬ 
plex  dependencies  in  the  corresponding  BN,  but  the  dependence  of  the  child  on  its  par¬ 
ents  will  always  remain. 

Sampling  Bias.  This  factorization  of  the  joint  distribution  gives  us  a  way  of  under¬ 
standing  the  rapid  diversity  loss  in  PPT-based  EDAs.  In  PLS  sampling,  for  each  ran¬ 
dom  variable,  the  sample  size  is  the  same  in  the  transformed  PPT.  However  the  actually 
meaningful  instructions  exclude  undefined  instructions.  The  size  of  the  sample  actually 
used  to  generate  meaningful  instructions  reduces  (exponentially)  w  ith  depth.  This  is  the 
cause  of  the  rapid  diversity  loss  due  to  sampling  drift:  unlike  other  EDAs,  in  which 
the  sample  size  is  the  same  across  all  variables,  drift  increases  due  to  reduced  sample 
size  with  depth.  Figure  5,  shows  the  population  (phenotype)  entropy  at  each  generation. 
We  only  show  the  100%  selection  ratio,  because  there,  there  is  no  diversity  loss  due 
to  selection,  the  whole  loss  is  the  result  of  sampling  drift.  In  both  problems,  the  loss 
of  diversity  due  to  sampling  drift  is  much  greater  in  the  PPT  representation  than  in  the 
PB1L  representation. 
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Generation  Generation 


Fig.  5.  Entropy  of  Population  vs  Generation  (Left:  One-max  Variant.  Right:  Max  (ind  :  indepen¬ 
dent  -  PBIL  -  structure)) 


6  Discussion 

The  importance  of  these  results  lie  not  merely  in  their  direct  implications  for  this  trivial 
problem,  but  in  their  implications  for  PPT-based  EDAs  for  GP.  Compare  these  problems 
with  typical  GP  problems.  The  dependency  depth  is  atypically  small,  corresponding  to 
a  GP  tree  depth  bound  of  only  3.  The  dependency  branching  is  typical,  or  even  slightly 
below  average,  for  GP  And  of  course,  the  fitness  landscape  is  vastly  simpler  than  most 
GP  problem  domains.  If  this  is  so,  why  has  EDA-GP  been  able  to  sueceed,  and  even 
demonstrate  good  performance  on  some  typical  GP  problems?  We  believe  it  is  due  to 
masking  of  the  problem  of  accelerated  drift  under  sampling  in  typical  implementations. 

These  implementations  generally  incorporate  mechanisms  reducing  the  effect  of 
sampling  drift:  better  selection  strategies  and  model  update  mechanisms,  adding  elitism 
and  mutation  all  contribute  to  this  reduction  In  addition,  our  problem  is  tougher  than 
typieal  GP  problems  in  one  respect:  there  is  only  one  solution  (two  for  the  max  prob¬ 
lem).  Most  problem  domains  explored  by  GP  have  symmetries,  so  that  eliminating  a 
solution  may  not  stymie  exploration.  Thus  EDA-GP  has  been  able  to  work  well  for  GP 
test  problems.  However  the  drift  effect  worsens  exponentially  with  tree  depth,  while 
these  ameliorating  mechanisms  only  scale  linearly.  Perhaps  this  is  w  hy  EDA-GP  has  so 
far  been  limited  to  demonstrations  on  test  problems  rather  than  practical  applications. 

Some  previous  PPT  rescareh,  notably  Hasegawa  and  Iba's  POLE  [9],  incorporates 
measures  to  ameliorate  sampling  drift  using  the  Extended  Parse  Tree.  Here,  our  foe  us 
is  to  clarify  the  effect  of  accelerated  drift  due  to  PPT  dependency,  as  a  preliminary  to 
investigating  solutions. 

7  Conclusions 

Diversity  loss  due  to  sampling  is  a  well-known  problem  in  EDA  research,  and  has  been 
carefully  studied  for  independent  probability  models.  It  is  well-known  that  the  the  prob¬ 
lem  worsens  in  probabilistic  dependency  models,  and  some  lower  bounds  for  the  effect 
have  already  been  found  [17].  However  there  does  not  appear  to  have  been  previous 
publication  of  the  effects  on  PPT-based  (branching)  EDAs. 
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By  studying  the  sampling  drift  effect  of  two  structures,  on  a  near-trivial  optimisation 
problem  and  another  only  slightly  harder,  we  were  able  to  see  the  importance  of  this 
diversity  loss.  The  effects  are  sufficient  to  east  doubt  on  the  scalability  of  most  cur¬ 
rent  approaches  to  EDA-GP.  Can  these  problems  be  overcome?  Can  scalable  EDA-GP 
systems  be  built?  We  believe  it  to  be  possible,  but  not  easy.  Any  remedy  must  coun¬ 
teract  the  depth  dependence  of  the  drift.  This  probably  eliminates  variants  of  some  of 
the  traditional  methods.  For  example,  it  is  difficult  to  see  how  to  incorporate  depen¬ 
dence  depth  into  population-based  mechanisms  such  as  elitism.  Similarly,  it  doesn’t 
seem  easy  to  use  mutation  or  similar  mechanisms  in  a  useful  depth-dependent  way.  On 
the  other  hand,  it  may  be  possible  to  incorporate  depth-based  mechanisms  into  model 
update  and/or  sampling  in  ways  that  might  be  able  to  overcome  the  depth-dependence 
of  sampling  drift,  and  so  permit  scaling. 

In  the  near  future,  we  plan  to  extend  this  work  in  three  directions.  The  first,  already 
in  progress,  involves  experimental  measurement  of  diversity  loss  to  gauge  the  extent 
of  acceleration  of  the  sampling  drift  effect.  The  second,  in  prospect,  will  attempt  to 
mathematically  estimate  the  diversity  loss  through  sample  size  estimation.  The  third 
extends  this  work  to  grammar-based  GP  EDA  systems  (i.e.  those  not  based  on  PPTs). 
Similar  problems  of  accelerated  sampling  bias  occur  in  these  systems,  though  it  is  more 
difficult  to  isolate  clear  demonstrations  of  this. 
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Abstract.  The  common  use  of  null  arguments  is  one  of  the  most  critical  issues  in 
pro-drop  languages.  When  translating  Korean  into  other  languages,  the  omitted 
elements  should  be  replaced  with  appropriate  pronouns  to  get  grammatical  target 
sentences.  One  of  the  most  important  issues  when  dealing  with  zero  pronouns 
is  to  determine  the  referentiahty  of  zero  pronouns.  Since,  like  expletive  ‘it’  in 
English,  omitted  elements  do  not  have  always  explicit  referents,  it  is  important 
10  determine  whether  a  zero  pronoun  is  referential  or  not.  In  this  paper,  we  focus 
on  identifying  non-referential  zero  pronouns.  Since  non-referential  zero  pronouns 
are  likely  to  occur  in  similar  contexts,  referentiahty  determination  in  this  paper  is 
regarded  as  the  identification  of  clauses  containing  non-referential  zero  pronouns. 
Our  method  outperforms  the  baseline  systems  using  n-grams  and  bag  of  words, 
and  achieves  lhe  E-measure  of  0.51  and  0.78. 

Keywords:  zero  pronoun,  ellipsis,  referentiahty,  anaphorieity,  parse  tree  kernel 


1  Introduction 

In  pro-drop  languages  such  as  Chinese,  Japanese  and  Korean,  it  is  important  to  iden¬ 
tify  referents  of  missing  elements  which  frequently  occur  in  sentences.  These  omitted 
elements  are  often  called  /ero  pronouns,  and  the  resolution  of  zero  pronouns  is  of  im¬ 
portance  for  various  applications  in  natural  language  processing  such  as  machine  trans¬ 
lation,  text  summarization,  information  extraction,  and  so  on. 

Zero  pronouns  are  divided  into  three  groups  according  to  the  positions  in  which  the 
referents  are  understood:  anaphora,  cataphora  and  exophora  f  1  ].  That  is,  all  zero  pro¬ 
nouns  do  not  have  explicit  referents  in  sentences.  For  that  reason,  recent  work  related 
to  reference  resolution  has  attempted  to  determine  the  referentiahty  (or  anaphorieity) 
of  nominal  expressions  including  zero  pronouns  [  1 1,12,17].  In  the  context  of  zero  pro¬ 
noun  resolution,  referentiahty  determination  is  the  task  of  judging  whether  a  given  zero 
pronoun  is  referential  or  non-referential.  If  its  explicit  referent  (or  antecedent)  is  found 
in  the  text,  the  zero  pronoun  is  classified  as  referential  (or  anaphoric);  otherwise,  it  is 
classified  as  non-referential  (or  non-anaphoric).  However,  the  performance  of  referen- 
tiality  determination  for  zero  pronouns  is  not  satisfactory  enough,  because  it  is  difficult 
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(1)  m3 1 -Oil  XI-LIJH  (0,-31)  ¥£!□ 

Chcolsu-NOM  bct-COMP  losc-PRED  (0,-NOM)  get  cross-PRED  PUNC 

W  hen  Choolsu  loses  a  bet,  _  [get]  cross. 

(2)  (0,-01)  2AI3-0I3  EIISf-01  <M61£|0iai-S  UI212ICI 

(0,-NOM)  be  aboutt'AO  o'clock-PRED  sun-NOM  behind  the  mountain -LOC  set-PREL)  PUNC 

If  _  [be]  about  two  o'clock,  the  sun  sets  hchind  the  mountain 

(3)  £¥01-31  5H2- Oil  31B10I  2MJ2I I  (0r3}) 

skua-NOM  the  sea  level-LOC  close-MOI)  be  flying-PRFIl  (0,-NOM)  carefuIly-MOD  watch-PAST-PRED 

{0,-31}  ^E£  2S&I.I  Sl-B-O 

(04-NOM)  bluc-cycd-MOD  cormorant-OHJ  be  attacking-PAST-PRED  PUNC 

A  skua  is  flying  close  to  the  sea  level,  so _ [be]  carefully  watched,  and  _  [he]  attacking  a  blue-eyed  cormorant. 


Fig.  I.  An  example  of  sentences  with  zero  pronouns 


to  distinguish  non-refcrential  from  referential  uses  of  the  same  forms.  .Most  of  previous 
studies  have  regarded  all  cases  which  fail  to  identify  the  referents  of  zero  pronouns  as 
non-referential.  However,  this  is  not  an  appropriate  solution  for  the  referential ity  of  zero 
pronouns,  since  there  can  be  errors  in  referent  identification  for  zero  pronouns. 

Figure  1  shows  an  example  of  sentences  containing  zero  pronouns  In  Figure  1 ,  zero 
pronouns  0i  and  04  are  referring  to  ‘Choelsif  and  ‘skua’  in  the  same  sentence  respec¬ 
tively.  However,  the  referents  of  o2  and  03  do  not  appear  in  the  text.  Thus,  <p2  is  the 
zero  pronoun  that  refers  to  time,  and  0 3  is  referring  to  the  speaker  which  is  a  discourse 
participant.  In  the  translation  of  Korean  to  English,  non-referential  zero  pronoun  0 2 
should  be  translated  into  fit’,  and  03  should  be  replaced  with  T  (or  ‘we').  However,  it 
is  difficult  to  obtain  additional  information  such  as  gender,  number  and  person  during 
translation,  because  the  referents  of  non-refcrential  09  and  0 3  do  not  explicitly  appear 
in  sentences.  Also,  in  the  case  of  referential  zero  pronouns,  such  information  is  not 
always  provided.  Therefore,  the  referential  ity  of  zero  pronouns  should  be  considered 
before  translating,  and  has  been  considered  as  one  of  the  most  important  issues  to  be 
addressed  for  practical  applications  like  machine  translation. 

This  paper  proposes  a  method  for  identifying  non-referential  zero  pronouns  in  sen¬ 
tences.  Previous  studies  have  determined  non-anaphoric  cases  through  pairwise  com¬ 
parisons  between  a  zero  pronoun  and  its  antecedent  candidates  [11],  and  in  most  cases, 
do  not  learn  non-anaphoric  cases  from  non-anaphoric  training  examples.  Thus,  the 
referentiality  of  zero  pronouns  in  previous  work  on  evaluating  the  preference  of  an¬ 
tecedent  candidates  is  determined  by  parametric  models  or  by  methods  of  identifying 
non-anaphoric  cases  indirectly  [10,15,16).  In  this  paper,  we  attempt  to  identify  non- 
referential  cases  directly  from  non-anaphoric  training  instances.  Since  they  are  likely  to 
occur  in  similar  contexts,  the  proposed  model  measures  the  syntactic  similarity  between 
the  contexts  in  which  zero  pronouns  occur.  To  do  this,  structural  information  of  clauses 
is  used  for  our  experiments.  In  addition,  the  majority  of  zero  pronouns  occur  in  subject 
grammatical  positions.  The  rate  of  subject  drop  is  approximately  94%  in  Korean,  which 
is  significantly  higher  than  zeros  in  other  positions  [12].  Therefore,  this  paper  focuses 
on  determining  the  referentiality  of  subject  zero  pronouns.  Referentiality  determination 
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in  this  paper  is  regarded  as  the  identification  of  clauses  containing  non-referential  sub¬ 
ject  zero  pronouns.  In  our  experiments,  support  vector  machines  with  a  parse  tree  kernel 
[6,13]  arc  used  to  examine  the  structural  similarity  between  clauses. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  surveys  previous  work 
on  zero  pronouns.  Section  3  proposes  a  method  for  identifying  non-referential  zero 
pronouns  in  machine  learning  approach  and  Section  4  presents  experimental  results 
and  the  conclusion  is  given  in  Section  5. 


2  Related  Work 

Most  studies  on  reference  resolution  including  zero  pronouns  are  widely  divided  into 
two  groups.  One  is  based  on  heuristic  rules  or  theoretical  approaches  such  as  Centering 
theory  [2],  Centering  theory  provides  a  model  of  local  coherence  in  discourse,  and  has 
usually  been  used  to  resolve  pronouns  in  English.  However,  it  is  difficult  to  deal  with  all 
types  of  zero  pronouns  in  the  framework  of  Centering,  since  it  is  not  easy  to  identify  the 
refercntiality  of  zero  pronouns  in  pro-drop  languages  which  allow  missing  subjects  such 
as  Chinese,  Japanese  and  Korean.  Roh  [  14]  proposed  a  cost-based  centering  model  for 
generating  zero  pronouns  corresponding  to  anaphoric  expressions  in  order  to  produce 
a  coherent  text  in  Korean.  However,  there  is  a  problem  in  applying  this  model  directly 
to  zero  pronoun  resolution.  In  addition,  the  use  of  non-referential  zero  pronouns  is  not 
considered  in  the  revised  centering  model  [14]. 

The  other  approach  is  based  on  machine  learning  methods  [20].  Previous  studies  can 
be  reclassified  according  to  w  hether  or  not  anaphoricity  (or  referentiality)  determination 
is  separated  from  antecedent  identification.  First,  previous  work  focusing  on  antecedent 
identification  classifies  noun  phrases  intervening  between  a  zero  pronoun  and  its  refer¬ 
ent  or  noun  phrases  which  are  not  involved  in  corcfercnce  chains  as  negative  instances 
[7,10].  These  studies  have  regarded  zero  pronouns  which  fail  to  identify  their  referents 
as  non-referential  cases.  However,  it  is  not  reasonable  to  determine  that  zero  pronouns 
in  such  eases  are  all  non-anaphoric,  since  there  can  be  errors  in  antecedent  identification 
model.  Recent  studies  have  attempted  to  determine  the  anaphoricity  of  zero  pronouns 
in  a  separated  step  [10,15,16].  For  zero  pronouns  in  Korean,  Han  [12]  has  attempted  to 
identify  the  referents  according  to  anaphoric  and  non-anaphoric  uses  of  zero  pronouns. 
However,  the  model  proposed  by  Han  [12)  was  designed  based  on  morpho-syntactie 
information.  In  addition,  by  trying  to  characterize  anaphoric  and  non-anaphoric  cases 
in  the  similar  manner,  they  did  not  provide  evidence  for  anaphoricity  determination, 
lida  [  15]  has  presented  anaphoricity  determination  model  using  syntactic  patterns.  The 
importance  of  structural  information  extracted  from  parse  tree  has  been  shown  in  re¬ 
cent  work  [15,18].  In  lida  [15]’s  work,  they  have  proposed  tournament-based  model 
which  learns  the  relative  preference  between  candidates.  The  most  likely  candidate  an¬ 
tecedent  of  a  zero  pronoun  is  selected  through  the  tournament  model,  and  the  final 
antecedent  is  determined  by  determining  w  hether  the  zero  pronoun  and  the  chosen  can¬ 
didate  antecedent  are  anaphoric.  However,  their  model  of  anaphoricity  determination  is 
parametric,  and  is  built  on  the  results  of  antecedent  identification. 

The  concern  of  anaphoricity  determination  has  also  been  expressed  in  English  [8, 1 7|. 
Recently,  Bergsma  [171  presented  an  approach  to  detecting  non-referential  pronouns 
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Type 

Example 

( 1 )  Deictic 

a.  OS^J  <30131. 

(You)  cat  lunch. 

b.  01  CO  AH  0  ;i“U 

(1)  am  happy  that  the  Korean  team  won. 

(2)  General  Situational 

c.  0  s SAID. 

(It)  is  ten  o'clock  already. 

d.  gySKHI  CH8HA1 

Younghee  explained  regarding  global  warming. 

(3)  Indefinite  Personal 

c.  0  ££!0f*  £2313  0  £011  2101  »□. 

If  (one)  wishes  to  catch  a  tiger,  (one  he)  must  go  to  the  mountains. 

Fig.  2.  An  example  of  non-referential  zero  pronouns 


in  text  based  on  the  distribution  of  the  pronoun’s  context,  in  order  to  determine  the 
refercntiality  of  English  pronoun  ‘it’.  However,  in  pro-drop  languages  that  allow  free 
word  order  and  frequent  ellipsis  of  elements,  the  occurrence  and  referentiality  of  zero 
pronouns  should  be  more  carefully  considered. 

Thus,  determining  the  referentiality  in  the  use  of  referring  expressions  is  of  impor¬ 
tance  for  reference  resolution  and  many  applications  in  natural  language  processing. 
The  refercntiality  of  zero  pronouns  has  emerged  as  an  important  issue  in  pro-drop  lan¬ 
guages,  but  the  performance  of  referentiality  determination  for  zero  pronouns  is  still  not 
satisfactory. 

3  Identification  of  Non-referential  Zero  Pronouns 

3.1  Non-referential  Zero  Pronouns 

Zero  pronouns  that  do  not  have  explicit  antecedents  in  the  same  text  are  regarded  as 
non-referential  ones  in  this  paper.  From  this  view,  exophoric  zero  pronouns  [  1  ]  also  arc 
treated  as  non-referential  although  they  refer  to  something  extralinguistic.  Therefore,  in 
this  paper,  non-referential  zero  pronouns  can  be  classified  as  follows  [  12]. 

( 1 )  Deictic  zero  pronoun 

(2)  General  situational  zero  pronoun 

(3)  Indefinite  personal  zero  pronoun 

Figure  2  shows  non-referential  uses  of  zero  pronouns.  Zero  pronouns  which  refer  to 
discourse  participants  such  as  the  speaker  and  the  hearer  are  classified  as  deictic  refer¬ 
ence  in  type  (1),  and  zero  pronouns  in  type  (2)  refer  to  time,  weather,  general  situation 
and  so  on.  In  addition  to  that,  idiomatic  expressions  such  as  “regarding”,  “according 
to”,  “for  the  sake  of”  and  so  on  are  also  classified  into  type  (2),  as  done  in  Han  1 12).  A 
pronominal  use  of  zero  pronouns  which  refer  to  a  generic  person  like  “one”  is  found  in 
type  (3).  Sometimes,  indefinite  zero  pronouns  in  type  (3)  can  be  used  to  refer  to  specific 
entities  which  are  not  explicitly  mentioned  in  the  context  In  this  paper,  three  types  of 
zero  pronouns  described  above  and  zero  pronouns  w  ith  verbal  or  clause  antecedents  are 
considered  as  non-referential. 
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7\  S|oj  Ofl  7\T)\0\  OI  gq  OAJS)  £  ft  ciM  |  gq  6|  2  «  «  ^  . 

Cl  C2={0,}  C3-{02} 

Cl:  A  skua  is  flying  close  to  the  sea  level 

C2:  (0,)  [be)  carefully  watched 

C3:  (0;)  [be]  attacking  a  blue-eyed  eormorant. 

Fig.  3.  The  parse  tree  of  a  Korean  sentence  with  zero  pronouns 


3.2  Identifying  Non-referential  Zero  Pronouns  Using  Structural  Information 

This  paper  focuses  on  subject  zero  pronouns  with  the  highest  frequency  of  occurrence 
[  1 2 1 .  Unlike  languages  such  as  Spanish  and  Italian,  zero  pronouns  in  languages  such  as 
Chinese,  Japanese  and  Korean  are  relatively  free  from  morpho- syntactic  restrictions.  In 
order  words,  the  resolution  of  zero  pronouns  in  Korean  is  not  sufficiently  supported  by 
rich  agreement  such  gender,  number,  and  person.  Unlike  previous  methods  that  rely  on 
measuring  the  preference  between  a  zero  pronoun  and  its  antecedent  candidates  [15,19], 
this  paper  uses  structural  information  of  clauses  to  identify  non-referential  uses  of  zero 
pronouns.  The  identification  of  clauses  containing  non-referential  zero  pronouns  has 
the  advantage  of  avoiding  unnecessary  comparisons  between  candidates.  Since  there 
arc  no  explicit  referents  in  sentences,  non-referential  zero  pronouns  is  not  effectively 
captured  between  competing  candidates.  Therefore,  it  needs  to  understand  the  refer- 
cntiality  of  zero  pronouns  from  a  different  perspective.  In  this  paper,  the  referentiality 
of  zero  pronouns  is  regarded  as  the  identification  of  clauses  with  non-referential  zero 
pronouns. 

Figure  3  shows  an  example  of  the  parse  tree  of  a  sentence  with  zero  pronouns.  The 
example  sentence  consists  of  three  clauses.  Cl,  C2,  and  C3.  In  Figure  3.  zero  pronoun 
4> i  in  clause  C2  is  non-referential  and  is  referring  to  a  discourse  participant.  On  the 
other  hand,  the  referent  of  02  in  clause  C3  is  ‘skua’  in  clause  Cl  That  is,  it  is  regarded 
as  referential.  For  our  experiments  of  non-referential  zero  pronouns,  the  structure  of  the 
clause  C2  is  used  as  a  positive  instances  in  the  training  phrase,  and  clauses  Cl  and  C3 
are  used  as  negative  examples.  Thus,  the  proposed  model  directly  learns  non-referential 
cases  using  non-referential  training  examples.  A  parse  tree  kernel  is  used  in  our  method 
for  modeling  syntactic  information  of  clauses.  We  assume  that  missing  subjects  arc 
already  detected  in  each  clause  like  most  studies  on  zero  pronouns  [15]. 


Identification  of  Non-referential  Zero  Pronouns 


117 


3.3  Support  Vector  Machine  with  Parse  Tree  Kernel 

The  identification  of  clauses  with  non-rcfcrcntial  subject  zero  pronouns  can  be  con¬ 
sidered  as  a  binary  classification  task.  Let  D  =  {(xj. y\) _ _  (xn,g/n)}  be  a  set  of 

training  examples  where  ijj  £  {  — 1  -f  1}  and  x,  =  c*.  Here,  each  a  is  a  clause  and  ijj 
is  the  class  label  associated  with  this  training  sample.  The  value  +1  of  yt  implies  that 
there  is  a  non-referential  subject  zero  pronoun  in  clause  c*. 

The  identification  of  non-referential  zero  pronouns  is  to  estimate  a  function  /  :  X  — > 
Y.  After  the  function  /  parameterized  by  0  is  trained  with  D,  the  relationship  detection 
y*  of  an  unlabclcd  example  x  can  be  determined  by 

y*  =  arg  max  (/(x,  6)  =  y) . 

Since  our  task  is  a  binary  classification,  support  vector  machines  (SVM)  arc  adopted  as 
an  implementation  of  the  function /.  The  decision  function  of  SVMs  is  defined  by 

y*  =  sgn(  I li<Xj<P(Xj)  ■  0(x)  +b),  (1) 

j€SV 

where  0  is  a  non-linear  mapping  function  from  5iA  to  S?7/  (N  A/),  SV  is  a  set  of 
support  vectors,  and  exj ,  b  £  3?,  c\j  >0.  The  mapping  function  0  should  be  designed 
such  that  all  training  examples  are  linearly  separable  in  space. 

Since  it  is  crucial  to  design  an  explicit  form  of  <p,  the  inner  product  of  0(x;)  and 
0(x)  is  computed  using  a  simple  kernel  such  that 

I< (xj.x)  =  d){Xj)  •  0(x). 

As  a  result,  when  a  kernel  Kp  is  designed  to  compute  the  inner  product  between 
clauses.  Equation  ( 1)  is  rewritten  as 

xf  =  sgn(  ViOjKpiXj.x)  +  !>)).  (2) 

j€SV 

In  order  to  apply  SVM  to  our  task,  a  number  of  positive  and  negative  examples  used  as 
D  arc  generated. 

A  parse  tree  kernel  is  used  to  measure  the  syntactic  similarity  between  clauses.  The 
parse  tree  kernel  is  a  specialized  convolution  kernel  introduced  by  Haussler  [31  and 
efficiently  reflects  structural  information  [6,13].  In  the  vector  representation  of  a  parse 
tree,  the  features  correspond  to  the  subtrees  that  can  possibly  appear  in  the  parse  tree. 
The  value  of  a  feature  is  the  frequency  of  the  corresponding  subtree  in  the  parse  tree. 
The  inner  product  of  the  vector  representations  of  two  trees,  7j  and  is  computed 
using  the  following  equation  (6]. 

<  Vrj ,  Vt2  > 

=  22#*ti(T,)-#sti(7h) 

i 

=  X3  H  W"i))-(  Y  7»'.(n2)) 

i  ti  i  £  jVTl  A’r2 

=  H  Y  C(n*’”a) 

ti]  6 


(3) 
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where  #sti(T)  is  the  frequency  of  a  subtree  stj  in  7,  and  Ntx  and  AV2  arc  the  sets  of 
nodes  in  T\  and  T2  respectively.  Isti  (71 1)  is  a  function  that  returns  the  frequency  of  sti 
rooted  at  n\  in  7\,  and  C(ri\ ,  7*2)  is  the  sum  of  the  product  of  the  numbers  of  times 
each  subtree  appears  at  n\  and  n2. 

4  Experimental  Results  and  Analysis 

4.1  Dataset 

For  our  experiments,  the  parsed  corpus  which  is  a  product  of  STEP  2000  project  sup¬ 
ported  by  Korean  government  is  used.  We  first  manually  identified  subject  zero  pro¬ 
nouns  in  the  parsed  corpus,  and  then  the  complex  compound  sentences  with  one  or 
more  subject  zero  pronouns  were  extracted  from  the  parsed  corpus.  A  simple  statistics 
on  the  dataset  is  given  in  Table  1.  The  number  of  selected  sentences  is  5,221  and  the 
sentences  are  segmented  into  20,748  clauses  (on  average,  3.97  clauses/sentence  and 
7.67  words/clause). 

Table  1.  A  simple  statistics  on  the  dataset  used  in  our  experiments 


Number 

Sentences 

5,221 

Clauses 

20,748 

Clauses  in  which  subject  /ero  pronoun>  oeeur 

13,171 

Table  2  shows  the  distribution  of  subject  zero  pronouns  observed  from  our  dataset. 
In  Korean,  the  large  proportion  of  zero  pronouns  can  be  resolved  in  the  same  sentences 
in  which  they  occur  as  shown  in  Tabic  2.  However,  the  number  of  extra-sentential  zero 
pronouns  corresponding  to  non-referential  is  also  not  small.  Extra-sentential  ones  in 
our  dataset  make  up  76%  of  non-intrasentential  ones.  Therefore,  it  is  important  to  dis¬ 
tinguish  non-referential  ones  from  the  sentences  in  which  zero  pronouns  occur  Since 
there  no  exist  their  explicit  referents  within  and  between  sentences,  it  will  be  effective 
to  deal  with  the  non-referential  use  of  zero  pronouns  at  the  sentence  level.  This  paper 
focuses  on  identifying  non-referential  zero  pronouns  in  the  context  of  clauses. 

Table  2.  The  distribution  of  subject  zero  pronouns  observed  from  our  dataset 


Intra-sentential 

Inter-sentential 

Extra-sentential 

10,371 

666 

2,134 

(78.749?') 

(5.06%) 

(16.20%) 

4.2  Experimental  Results  and  Analysis  for  Referentiality  Determination 

Our  experiments  are  performed  in  five-fold  cross  validation  and  SVMnghtl 4]  is  used  as 
classifiers.  The  accuracy  and  F-measure  are  used  to  evaluate  the  results  of  the  identifica¬ 
tion  of  non-referential  zero  pronouns,  and  these  are  calculated  as  follows.  The  balanced 
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F-score  which  is  the  harmonic  mean  of  recall  and  precision  is  used  in  our  experiments. 
The  results  are  shown  in  Table  3. 

number  of  correctly  classified  clauses 

Accuracy  =  - - - — - 

total  number  of  clauses 

number  of  correctly  identified  non-referential  zero  pronouns 

Precision  =  - — - — - :: - : - 

number  of  identified  non-referential  zero  pronouns 

number  of  correctly  identified  non-referential  zero  pronouns 

Recall  =  - - - - - ; - - - 

number  of  true  non-referential  zero  pronouns 

Table  3.  The  performance  of  referentiality  determination  of  subject  zero  pronouns 


Model 

Accuracy 

F-nieasure 

Voting 

88.55% 

- 

N-grams  (n=6) 

90  06% 

6.92 

I3CJ\\  clauae 

92.12% 

40.19 

STRUC 

92.12% 

42.13 

STRUC+  (r=0.8) 

92.17% 

51.09 

To  investigate  the  effect  of  structural  information  in  this  study,  ‘n-grams’  and  ‘BOW1 
models  are  used  as  baseline  systems.  ‘N-grams’  model  is  based  on  the  context  surround¬ 
ing  zero  pronouns  regardless  the  division  of  clauses.  In  this  paper,  three  words  preceding 
and  following  zero  pronouns  are  extracted  as  features  of  the  ‘N-grams'  model.  It  can 
be  viewed  as  a  simplified  version  of  the  model  introduced  by  Bergsma  [17J.  In  ‘vot¬ 
ing’  model,  the  final  classification  decision  is  taken  by  a  simple  majority  vote.  When 
the  majority  agree,  it  is  classified  as  ‘positive’,  and  the  accuracy  is  88.55%.  However, 
since  this  leads  to  the  result  that  the  identification  of  non-referential  zero  pronouns  is 
not  performed,  f  urther  research  is  needed.  In  the  bag  of  words  (BOW)  model,  a  clause  is 
represented  as  unordered  collection  of  words.  As  shown  in  Table  3,  ‘BOW’  model  based 
on  clauses  outperforms  ‘n-grams’  model.  In  particular,  the  context  size  of  ‘n-grams’  is 
similar  to  average  length  of  clauses  in  ’BOW’  model,  but  the  recall  of  ‘n-grams’  is  quite 
low.  It  implies  that  information  obtained  from  the  unit  of  clauses  is  useful  in  identifying 
non-referential  zero  pronouns.  ‘STRUCT  and  ‘STRUC+’  are  models  using  structural  in¬ 
formation  of  clauses  proposed  in  this  paper.  Here,  ‘STRUC’  is  using  syntactic  features 
obtained  from  the  parse  tree  of  clauses  and  ‘STRUC+’  model  combines  the  syntactic 
features  and  a  set  of  features  extracted  from  words  which  occur  in  clauses,  similarly 
to  the  BOW  model.  Thus,  the  composite  kernel  K  for  identifying  non-referential  zero 
pronouns  is  then 

K  —  r  K\  -j-  (1  -  r)  •  K 2, 

where  r  (0  <  r  <  1 )  is  a  mixing  parameter,  and  K\  and  A'2  are  a  parse  tree  kernel  and  a 
polynomial  kernel  with  degree  3  respectively.  In  our  experiments,  the  parameter  r  is  set 
to  0.8  empirically,  as  shown  in  Figure  4.  The  fact  that  the  performance  with  larger  r  is 
superior  to  that  with  small  r  implies  that  syntactic  information  is  more  positively  related 
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to  the  identification  of  non-refercntial  zero  pronouns.  Although  this  paper  focuses  on 
investigating  the  effect  of  structural  information  in  the  identification  of  non-referential 
zero  pronouns,  overall  performance  will  be  much  better  if  a  composite  kernel  using 
both  structural  information  and  semantic  information  is  used  in  the  future. 

As  shown  previously,  the  performance  of  our  method  outperforms  baseline  systems. 
However,  while  the  accuracy  of  the  proposed  models  is  quite  high,  the  performance  in 
terms  of  the  f-measure  is  not  satisfactory.  This  may  be  related  to  the  problem  of  im¬ 
balanced  data  sets.  In  our  dataset,  the  number  of  negative  samples  is  much  larger  than 
that  of  positive  ones  and  is  approximately  nine  times  higher  than  that  of  positive  ones. 
A  classifier  induced  from  an  imbalanced  data  set  has,  typically,  a  low  error  rate  for  the 
majority  class  and  an  unacceptable  error  rate  for  the  minority  class.  In  this  situation,  it  is 
important  to  accurately  classify  the  minority  class  in  order  to  reduce  the  overall  cost.  In 
order  to  solve  these  problems,  several  methods  can  be  considered  such  as  reweighing, 
undersampling,  and  resampling  [5,9].  In  this  paper,  random  under-sampling  is  consid¬ 
ered,  which  involves  under-sampling  the  majority  class  samples  at  random  until  their 
numbers  matched  the  number  of  minority  class  samples.  The  results  of  sampling  are 
shown  in  Table  4  and  Figure  4.  In  our  method,  the  precision  and  recall  after  sampling 
are  79.30%  and  78.00%  respectively.  This  shows  that  the  problem  of  imbalanced  data 
sets  is  significant  in  the  identification  of  non-refercntial  zero  pronouns.  In  the  future, 
this  study  will  investigate  the  use  of  ensemble  methods  such  as  bagging  and  boosting 
to  deal  with  imbalanced  data. 

Table  4.  A  performance  comparison  of  sampling  in  the  identification  of  non-referential  zero 
pronouns  using  structural  information  (r=0.8) 


Accuracy 

Precision 

Recall 

F-measure 

Before  sampling 

92.17 

87.45 

39.71 

5 1 .09 

After  sampling 

78.81 

79.30 

78.00 

78.64 

100  00% 

90.00%  * 

80.00% 

70.00% 

60.00% 

50.00% 

40  00%  ♦- 
30.00% 

20.00% 

10.00% 

0.00% 

0.0  01  0.2  0  3  04  0.5  06  0.7  0.8  0.9  1.0 
The  value  of  mixing  parameter  r 

(a)  Before  sampling 


80.00% 


72.00%  ■ 

0.0  0 1  0  2  0.3  04  0.5  0.6  0  7  0.8  0.9  1.0 

The  value  of  mixing  parameter  r 

(b)  After  sampling 


Fig.  4.  A  comparison  of  performance  before  and  after  sampling 
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4.3  An  Application  to  Identification  of  Subject  Shareness 

Before  applying  our  method  to  machine  translation,  this  paper  attempts  to  investigate  the 
effect  of  the  referentiality  in  zero  pronoun  resolution.  Frequent  omissions  of  subjects  in 
Korean  sentences  imply  that  several  predicates  can  share  one  subjeet.  This  is  related  to  the 
subject-sharing  problem  of  clauses  [  1 8].  When  identifying  antecedents  of  omitted  sub¬ 
jects  in  intra-sentential  resolution,  it  is  necessary  to  determine  whether  their  antecedents 
exist  in  the  same  sentences.  In  this  situation,  in  order  to  investigate  how  referentiality 
determination  affects  subjeet  shareness  problem,  this  paper  applies  the  referentiality  de¬ 
termination  to  the  model  proposed  by  Kim  1 18].  Thus,  if  non-referential  zero  pronouns 
identified  correctly  by  referentiality  determination  are  excluded  before  antecedent  iden¬ 
tification,  the  performance  of  the  identification  of  subject  shareness  may  be  improved. 

Table  5.  The  effect  of  referentiality  determination  in  subject  shareness  identification  (SSI) 


Accuracy 

Precision 

Recall 

P-  measure 

SSI 

76.34 

69.55 

61.58 

65.30 

Rcfercntiality+SSI 

76.81 

70.52 

68.64 

69.56 

Table  5  shows  the  results  of  subjeet  shareness,  and  these  results  indicate  that  the 
referentiality  determination  can  play  a  positive  role  in  the  model  of  subject  shareness. 
Therefore,  it  will  be  very  useful  for  zero  pronoun  resolution  or  practical  applications 
like  machine  translation  if  the  performance  of  referentiality  is  more  stable. 

5  Conclusion 

Referential  expressions  including  zero  pronouns  commonly  occur  in  texts.  T  he  identi¬ 
fication  of  objects  referred  to  by  them  is  an  important  research  area  in  natural  language 
understanding.  Like  expletive  ‘it*  or  ‘there’  pronouns  in  English,  zero  pronouns  do  not 
always  refer  to  objects  which  explicitly  occur  in  texts.  Therefore,  it  is  important  to 
distinguish  non-referential  ones  from  the  use  of  zero  pronouns  which  are  frequent  in 
pro-drop  languages. 

This  paper  focuses  on  identifying  non-referential  subject  zero  pronouns  in  Korean 
sentences.  The  proposed  model  learns  structural  information  of  clauses,  and  directly 
identifies  non-referential  uses  using  non-referential  training  instances.  Our  experimen¬ 
tal  results  show  that  inf  ormation  of  clauses  are  important  to  identify  non-referential  zero 
pronouns.  Our  method  outperforms  the  baseline  systems  and  the  obtained  results  show 
that  structural  information  of  clauses  plays  a  positive  role  in  solving  our  task. 

In  the  future,  we  plan  to  apply  the  proposed  method  to  a  practical  Korean-English 
machine  translation  system.  In  addition,  future  work  is  needed  to  develop  more  ad¬ 
vanced  methods  to  determine  the  referentiality  from  imbalanced  data. 
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Abstract.  Previous  efforts  to  identify  idiomatic  expressions  using  a  bilin¬ 
gual  parallel  corpus  have  focused  on  the  method  of  using  word  alignments 
to  catch  the  sense  of  individual  words  In  this  paper,  we  propose  a  method 
of  using  phrase  alignments  rather  than  word  alignments  in  a  parallel  cor¬ 
pus  to  recognize  the  sense  of  phrases  as  well  as  words.  Our  proposed  scor¬ 
ing  functions  arc  based  on  the  difference  of  translation  tendency  between 
a  phrase  and  individual  words.  They  can  help  11s  identify  idiomatic  ex¬ 
pressions  with  a  entropy  variation  and  a  translation  difference  between  a 
phrase  and  individual  words.  Experimental  results  show  that  our  proposed 
method  is  more  effective  than  previous  approaches  for  the  identification  of 
idiomatic  expressions.  I11  addition,  we  provtnl  that  linguistic  constraints 
can  be  integrated  into  our  method  to  improve  the  performance. 


1  Introduction 

A11  idiomatic  expression  is  often  defined  as  a  sequence  of  words  which  has  a 
different  meaning  from  the  composition  of  the  meaning  of  its  individual  words, 
although  it  is  difficult  to  find  a  universal  definition  that  covers  all  kinds  of  typical 
idioms  such  as  “kick  the  bucket”  and  "give  up”  [1].  In  this  paper,  we  regard 
idiomatic  expressions  as  11011-col npositional  expressions  in  the  same  manner  as 
some  previous  works  for  the  identification  of  idiomatic  expressions  1,2,3]. 

Identifying  idiomatic  expressions  is  invaluable  for  natural  language  processing 
applications  such  as  machine  translation,  information  retrieval,  and  so  on.  Most 
rule-based  machine  translation  systems  generally  translate  idiomatic  expressions 
prior  to  the  word-for-word  translation  step  in  order  to  keep  the  adequacy  in  the 
first  step.  It  is  necessary  to  identify  idiomatic  expressions  in  a  user  query  to 
improve  the  effect  of  the  query  expansion  in  information  retrieval.  Moreover, 
idiomatic  expressions  can  be  list'd  as  a  significant  unit  when  documents  are 
indexed  by  terms. 

Our  task  can  be  summarized  as  follows: 

Input:  A  sequence  which  contains  two  or  more  words. 

Output:  A  score  that  shows  how  much  the  input  is  idiomatic  or 

11011-compnsitional. 
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An  expression  with  high  score  is  more  idiomatic  than  the  one  with  low  score. 
This  definition  is  same  as  that  of  the  task  carried  out  in  [3].  We  are  interested 
in  scoring  how  close  a  word  sequence  is  to  an  idiomatic  expression. 

Most  previous  efforts  have  used  the  statistical  information  from  a  corpus  to 
identify  idiomatic  expressions.  They  are  classified  into  two  groups  by  the  corpus 
type,  which  is  either  a  monolingual  or  a  bilingual  corpus.  Up  to  date,  the  ap¬ 
proaches  using  monolingual  corpora  [1 ,4]  are  much  more  prevalent  than  efforts 
using  bilingual  corpora  [3,5]  due  to  the  convenience  of  collecting  the  corpora. 

However,  statistical  machine  translation  has  been  receiving  increasing  atten¬ 
tions  over  the  last  decade  arid  has  leaded  the  production  of  bilingual  parallel 
corpora  available  in  various  language  pairs.  For  this  reason,  exploring  the  bilin¬ 
gual  parallel  corpora  has  become  an  interesting  topic  for  researchers  in  order 
to  extract  useful  knowledge  such  as  paraphrases  [0],  bilingual  or  limit i- lingual 
dictionaries  [7,8]. 

The  motivation  of  using  bilingual  corpus  rather  than  monolingual  corpus  for 
idiomatic  expression  identification  is  as  follows.  By  translating  a  multi-word 
expression,  we  can  easily  test  whether  it  is  an  idiomatic  expression  or  not.  A 
word  may  be  translated  differently  according  to  the  idiomatic  expressions  it 
occurs  in.  If  we  cannot  easily  translate  the  combination  word  by  word  (with 
default  translation1),  then  that  is  strong  evidence  of  an  idiomatic  expression. 
Nevertheless,  there  are  not  much  work  on  identifying  idiomatic  expressions  using 
bilingual  parallel  corpora. 

The  previous  approaches  [3,9]  using  bilingual  corpora  measured  the  transla¬ 
tional  entropy  or  the  proportion  of  default  translation  of  individual  words  in  a 
given  expression  to  rank  given  candidate  expressions  and  to  identify  idiomatic 
expressions. 

Although  they  have  shown  some  promising  results,  there  are  two  limitat  ions 
using  only  word  alignments.  Firstly,  the  methods  using  word  alignments  can  gen¬ 
erate  some  errors  in  the  process  of  calculating  the  translation  entropy  of  a  word 
or  of  extracting  default  translations  of  a  word.  A  source  word  may  be  translat  ed 
into  more  than  one  target  word  (one- tom  any  alignment)  as  well  as  exactly  one 
word  (one-to-one  alignment).  The  word- based  methods  cause  the  problem  that 
they  measure  the  translational  entropy  imprecisely  or  extract  the  default  word 
translation  incorrectly,  because  an  one-to-many  alignment  is  regarded  as  multi¬ 
ple  one-to-one  word  alignments  rather  than  a  single  one-to-one  phrase  alignment. 
Secondly,  the  phrase-level  translations  are  not  considered  in  the  previous  meth¬ 
ods,  while  they  inspect  only  the  word-level  translation  of  expressions  using  word 
alignments.  For  identifying  idiomatic  expressions,  we  assume  that  it  is  impor¬ 
tant  to  analyze  the  difference  of  the  translation  tendency  between  a  phrase  and 
individual  words  in  the  phrase,  which  is  not  considered  in  previous  approaches. 

In  this  paper,  we  propose  a  method  of  using  phrase  alignments  rather  than 
word  alignments  in  a  parallel  corpus  to  identify  idiomatic  expressions.  In  order 
to  identify  idiomatic  expressions  more  precisely,  we  propose: 


1  The  default  translation  of  a  word  or  a  phrase  means  the  most  typical  translation 
into  the  target  language. 
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—  examining  a  method  of  using  phrase  alignments  instead  of  word  alignments 
calculating  the  idiomatic  expression  score  by  new  scoring  functions  based  on 
the  phrase  alignments 

The  rest  of  this  paper  is  structured  as  follows.  In  section  2,  we  propose  our  novel 
scoring  functions  for  identifying  idiomatic  expressions  and  the  method  for  phrase 
alignment.  After  that  we  evaluate  the  proposed  method  and  analyze  the  results 
in  section  3.  We  conclude  the  paper  with  some  future  works  in  section  4. 

2  Phrase- Alignment  Based  Idiomatic  Expression 

Identification 

In  this  section,  we  present  the  intuitions  of  our  method  and  proposed  scoring 
functions  to  identify  idiomatic  expressions  using  a  bilingual  parallel  corpus. 

2.1  Finding  Phrase  Alignment 

It  is  necessary  to  extract  not  only  word-based  properties  but  also  phrase- based 
properties  in  a  corpus  for  identifying  idiomatic  expressions  because  they  are 
phrases  -  a  sequence  of  two  or  more  words.  We  propose  a  method  of  using 
phrase  alignments  for  identifying  these  expressions  in  a  bilingual  parallel  corpus. 
Th<  phrase  alignments  provide  useful  statistics  used  to  predict  the  translation 
tendency  of  a  phrase. 

The  phrase  alignment  has  been  widely  studied  in  the  area  of  the  statisti¬ 
cal  machine  translation  [10,11,12,13,14].  It  aims  to  link  a  source  phrase  to  a 
target  phrase  which  is  likely  to  be  the  translation  of  the  source  phrase  in  a 
given  parallel  sentence.  Fig.  I  shows  examples  of  word  alignments  and  their 
phrase  alignment.  The  black  small  boxes  in  the  alignment  table  indicate  word 
alignments  in  a  English- Korean  sentence  pair,  “John  kicked  the  bucket”  and 


Fig.  1.  Examples  of  Word  Alignment  and  Phrase  Alignment 
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“#°]  tf]  (john-i  se-sang-eul  ddeo-nat-da)” .  And  the  large  quadran¬ 

gle  including  three  word  alignment  boxes  shows  a  phrase  alignment  in  the  right- 
side  table.  English  phrase  ‘"kicked  the  bucket”  is  aligned  with  Korean  phrase 
(se-sang-eul  ddeo-nat-da)”  in  phrase  level.  These  phrase-level 
approaches  leaded  to  great  improvement  in  statistical  machine  translation. 

We  adopt  statistical  phrase-based  translation  [10]  to  find  phrase  alignments 
in  a  bilingual  parallel  corpus.  Although  there  are  a  variety  of  phrase  alignment 
techniques,  we  use  the  method  proposed  by  Och  and  Ney  [1 1]  among  them.  It 
is  the  most  popular  method  which  extracts  all  aligned  phrase  pairs  from  word 
alignment  result.  Phrase  alignments  by  this  method  include  one-to-one,  one-to- 
many,  many-to-one,  and  many- to- many  word  alignments. 

2.2  Scoring  Idiomatic  Expression 

We  propose  two  novel  scoring  functions  based  on  phrase  alignments  and  the 
combination  method  of  two  functions.  The  functions  commonly  output  the  score 
which  shows  the  degree  of  the  closeness  to  idiomatic  expressions,  given  a  phrase 
as  input. 

DTE:  Decrement  of  Translational  Entropy.  An  idiomatic  expression  is 
a  phrase  which  has  a  meaning  that  cannot  be  derived  by  decomposing  it  into 
its  words.  The  translation  of  an  idiomatic  phrase  tends  to  be  limited  to  only 
a  few  target  phrases,  even  though  each  word  in  the  phrase  may  be  translated 
as  various  words  or  phrases  in  the  corpus.  For  example,  Korean  translations 
of  English  phrase  ‘‘lie  down”  are  significantly  restricted  to  (nup-da)”  or 

^irt^dcu-reo- nup-da)”,  while  Korean  translations  of  the  word  “lie”  or 
“down”  are  various  and  evenly  distributed. 

Therefore,  it  is  important  to  investigate  a  decrement  of  the  translational  en¬ 
tropy  when  individual  words  grouped  together  as  a  phrase.  In  other  words,  if  the 
average  translational  entropy  of  individual  words  is  high  and  the  translational 
entropy  of  the  phrase  itself  including  them  is  low,  it  is  more  likely  to  be  an 
idiomatic  expression.  The  following  equation  reflects  this  idea. 

Score  or  e{p)  =  +  (1  _  H(Tp\p)))  (1) 

where  Wp  is  a  set  of  words  in  the  phrase  p  and  Tp  is  a  set  of  phrases  aligned 
with  p,  II(Tp\p)  is  the  translational  entropy  [9]  of  which  is  calculated  in  the 
following  equation: 


H(Tp\p)  =  -  £  P(t\p)logP(t\p)  (2) 

t€Tp 

We  select  the  base  of  the  logarithm  according  to  the  size  of  Tp  to  normalize  the 
entropy  into  the  value  between  0  and  1.  This  normalization  allows  the  entropy 
to  be  comparable  and  Score  ote ()  to  return  the  value  between  0  and  1.  P(t\p)  is 
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identical  to  the  phrase  translation  probability  estimated  by  the  relative  frequency 
of  phrase  pairs  in  statistical  phrase-based  translation  [10]. 


pm  = 


count  (t,p) 
£t  <'oimt(t.p) 


(3) 


For  example,  the  scores  of  the  literal  phrase  “tv  drama’'  and  the  idiomatic  phrase 
"new  york”  are  calculated  as  follows.  These  examples  show  that  our  first  function 
helps  us  distinguish  idiomatic  phrases  from  literal  phrases. 


1  0  28  4-  0  18 

Score. ‘dte (“tv  dr ama”)  =  -(- - - - f-  (1  -  0.73))  =  0.32  (4) 


Score dte york")  —  -( — ^ - t-  (I  0.19))  =  0.72  (5) 


DTW:  Difference  of  Translated  Words.  In  the  second  scoring  function,  we 
use  the  default  phrase  translations  of  words  or  phrases  to  recognize  the  meaning 
of  them.  A  source  phrase  is  most  likely  translated  into  the  default  phrase  trans¬ 
lation  of  it.  For  instance,  an  English  phrase  “give  up”  has  the  Korean  default 
phrase  translation  “5.7]  ^1*cJ*(po-gi-ha-da)”  whose  meaning  is  “to  stop  trying  to 
do  something”. 

We  assume  that  there  exists  larger  translational  difference  between  the  phrase 
arid  individual  words  in  an  idiomatic  phrase  than  in  a  literal  phrase.  The  dif¬ 
ference  can  be  found  by  inspecting  default  word  translation  and  default  phrase 
translation.  The  following  equation  is  the  scoring  function  for  quantifying  the 
difference. 


Score  dtw  {p)  =  1  - 


1  d„  n  U„.gH-  u'/)„ 


H 


(6) 


where  Dp  is  a  set  of  default  phrase  translations  of  the  phrase  p.  i.e.  Ar-best 
translations  of  p.  and  Dw  is  also  AT- best  translations  of  the  word  w.  The  optimal 
N  is  empirically  obtained  by  experiment.  Wv  is  a  set  of  words  in  p  like  the 
preceding.  As  the  following  equation  shows.  Wop  and  Wdu,  mean  sets  of  all 
words  in  Dp  and  D(l ,,  respectively. 


WD„  =  1J  Wd 

Dp 


(7) 


rFlie  denominator  of  equation  G  means  the  number  of  words  in  default  translations 
of  the  phrase  p.  The  numerator  means  the  number  of  words  which  occur  in  both 
default  translations  of  p  and  all  default  translations  of  the  individual  words,  if 
the  fraction  is  large,  there  are  few  differences  between  them.  This  indicate  that 
p  is  close  to  a  literal  expression.  We  subtract  the  fraction  from  1  to  give  high 
scores  to  idiomatic  phrases. 

The  intuition  of  this  scoring  function  is  similar  to  that  of  the  proportion  of 
default  alignment  (PDA)  proposed  by  previous  work  [3].  However,  we  directly 
extract  default  phrase  translations  using  phrase  alignments. 
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For  example,  the  scores  of  the  literal  phrase  “tv  drama71  and  the  idiomatic 
phrase  take1  charge  of'1  are  calculated  as  follows.  These  examples  show  that  our 
second  function  also  helps  us  distinguish  idiomatic  phrases  from  literal  phrases. 
We  assume  that  N  is  set  to  2  in  this  example. 

Dtv  ~  {/t\  tel— le— bi— jeon) 

D drama  =  {dcu-ra—rna,  sa—geuk } 

D tv  dmmo  =  {deu— ra— ma,  tv  deu— ra— ma } 

(8) 

3 

Score dtw  (tv  drama)  =  1  —  —  =  0.00 

O 

(9) 

Dtake  —  {  ehwi—ha—da ,  h  a— da  } 

D charge  =  {hyeom— cui<  go  it } 

D0f  —  {cui,e  dae—han} 

Diake  charge  of  =  [vcul  mat,77Ult} 

(10) 

Score  orw  (take charge of)  —  1  —  ^  =  1.00 

(11) 

We  derive  the  final  scoring  function  in  which  two  proposed  functions  are  com¬ 
bined  linearly  as  follows.  The  parameter  A  is  estimated  empirically. 


Score co, nb(p)  -  XScoreDTE(p)  +  (1  -  X)ScoreDT\v(p)  (12) 


3  Experiments 

3.1  Setup 

We  have  experimented  with  an  English- Korean  parallel  corpus  to  acquire  English 
idiomatic  expressions.  The  corpus,  which  includes  about  half  a  million  sentence 
pairs,  was  collected  from  English-Korean  bilingual  news  websites.2  Table  1  shows 
statistics  of  the  collections. 

We  automatically  aligned  source  words  with  target  words  using  the  GIZATT 
toolkit  [15]  in  the  corpus.  We  symmetrized  the  bidirectional  results  of  word 
alignments  using  three  types  of  heuristics;  intersection ,  union ,  and  grow-diag- 
firial.  All  experimental  results  in  this  section  are  delivered  from  grow- diag- final 
because  it  reaches  the  best  performance. 

Next,  wre  extracted  phrase  pairs  from  the  word  aligned  corpus  using  the  phrase 
extraction  algorithm  proposed  by  Och  and  Ney  [11]  and  estimated  the  transla¬ 
tion  probability  of  every  unique  phrase  pair  by  calculating  the  relat  ive  frequency  of 
phrase  pairs.  The  translation  probabilities  are  used  to  calculate  the  phrase  trans¬ 
lational  entropy  and  to  find  default  phrase  translations  of  phrases.  We  extracted 
Ar-best  phrase  alignments  with  high  translation  probability  for  each  phrase  in  the 
corpus  in  advance  to  use  as  default  phrase  translations.  N  is  set  as  2  experimentally. 

It  is  necessary  to  construct  a  set  of  test  phrases  to  evaluate  the  proposed 
method.  Also,  each  test  phrase  should  have  the  gold  annotation  that  indicates 

It  is  a  part  of  the  resources  from  on-going  project  sponsored  by  SK-telecom,  Korea. 


2 


Identifying  Idiomatic  Expressions 


129 


Table  1.  Corpus  Statistics 


English  Korean 
Training  Sentences  493,000 

Words/ Morphemes  10,857,008  12,808,977 


whether  it  is  an  idiomatic  expression  or  not.  The  candidate  phrases  for  the 
evaluation  may  be  collected  using  various  heuristics  or  linguistic  constraints. 
For  example,  VP-PP  t  uples  were  used  as  test  phrases  in  previous  work  [3].  Our 
evaluation  focus  on  the  scoring  function  for  identifying  idiomatic  expressions 
in  a  set  of  candidate  phrases  rather  than  the  extraction  of  candidate  phrases. 
For  this  reason,  we  simply  extracted  candidate  phrases  using  phrase  extraction 
algorithm  and  several  constraints.  Our  every  candidate  phrase  occurs  three  or 
more  times  ill  the  first  200, 000  sentences  and  involves  two  or  more  content  words. 
We  sampled  300  phrases  from  all  candidate  phrases  and  then  two  annotators 
manually  annotated  all  idiomatic  expressions  in  the  phrase  set.  Among  them  55 
phrases  were  annotated  as  idiomatic1  expressions  by  both  annotators.  The  inter- 
annotator  agreement  for  these  annotations  was  measured  at  0.863  agreement 
rate  and  0,638  kappa  value. 

We  used  average  precision  to  evaluate  the  ranked  result.  The  evaluation  mea¬ 
sure,  which  is  frequently  used  in  information  retrieval  field,  emphasizes  ranking 
relevant  items  higher: 


w-  ^i(';c>>  — (i,3) 

number  of  relevant  items 

where  r  is  the  rank,  N  is  the  number  of  retrieved  items,  rel()  is  an  indicator 
function  on  the  relevance  of  a  given  rank,  and  P(r)  is  precision  computed  at 
the  point  of  the  rank.  In  our  case,  candidate  phrases  and  idiomatic  expressions 
correspond  to  items  and  relevant  items,  respectively. 

3.2  Experimental  Results 

Baseline.  We  implemented  translational  entropy  (TE)  and  proportion  of  de¬ 
fault  alignment  (PDA)  proposed  by  Melamed  [9]  and  Moiron  [3  respectively  as 
baselines  compared  with  our  proposed  method. 

Table  2  shows  the  performances  of  the  identification  of  English  idiomatic 
expressions  using  TE  and  PDA.  Fig.  2  shows  the  combination  performances 

Table  2.  Performances  with  T1  and  PDA 


Alignment  Type 

Scoring  Function 

AveP 

PCS20 

P®30 

P955 

Word  Alignment 

TE  (A  =  1) 

0.312 

0.450 

0.333 

0.291 

(baseline) 

PDA  (A  -  0) 

0,244 

0,250 

0.267 

0.291 
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AvcP 

0,4 


0.2  I  ,  r  - r 

0  0.1  0.2  0.3  0.4  O.S  0.6  0.7  0.8  0.9  1  A 

Fig.  2.  Average  Precision  of  TE  -f  PDA  according  to  lambda 


Table  3.  Effect  of  DTE  and  DTW 


Alignment  Type 

Scoring  Function 

AveP 

P920 

P@30 

P®55 

Word  Alignment 
(bas*eline) 

TE+PDA  (A  =  0.9) 

0.323 

0.350 

0.333 

0.273 

Phrase  Alignment 

DTE  (A  =  1) 

0.341 

0.400 

0.333 

0.364 

(proposed  method) 

DTW  (A  =  0) 

0.440 

0.650 

0.600 

0.491 

DTE + DTW  (A  =  0.5) 

0.508 

0.650 

0.633 

0.473 

of  two  approaches  according  to  the  weight  lambda.  The  best  performance  was 
obtained  when  the  weight  was  set  to  0.9.  We  use  this  figures  as  a  baseline  for 
our  study. 


Effect  of  DTE  and  DTW.  Table  3  shows  the  performance  for  English  id¬ 
iomatic  expressions  identification  in  an  English-Korean  parallel  corpus.  The  first 
row  is  the  baseline  and  the  followed  three  rows  are  the  results  by  our  proposed 
scoring  functions.  Both  two  proposed  functions  DTE  and  DTW  achieved  better 
performances  than  the  baseline.  This  result,  shows  that  examining  phrase  align¬ 
ments  produce  positive  effects  and  proposed  functions  improve  the  performance 
of  idiomatic  expressions  identification. 

DTP]  is  a  phrase-level  extension  of  TE  and  DTW  is  a  phrase-level  extension  of 
PDA.  In  terms  of  these  extensions,  we  found  that  both  DTE  and  DTW  are  more 
effective  in  idiomatic  expressions  identification  than  TE  and  PDA,  respectively 
and  the  latter  brought  about  larger  effects  than  the  former.  This  shows  that 
the  use  of  default  phrase  translations,  which  was  not  considered  in  the  baseline 
approaches,  is  very  useful. 

However,  the  reason  why  DTE  produces  disappointing  performances  is  found 
in  the  phrase  translational  entropy  calculation  step.  There  are  many  target 
phrases  with  same  topic  and  different  expressions  in  a  set  of  translated  phrases 
of  an  source  phrase.  Such  different  target  phrases  with  same  meaning  propagate 
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Fig.  3.  Average  Precision  of  DTE  4  DTW  according  to  lambda 


Fig.  4.  Recall- Precision  Graph  of  Previous  and  Proposed  Scoring  Functions 


many  errors  into  the  entropy  of  each  phrase.  We  expect  to  minimize  these  errors 
by  clustering  target  phrases  aligned  with  the  source  phrase  in  the  future. 

We  also  observed  that  DTW  is  complementary  to  DTE  by  combining  the  two 
functions.  The  last  row  of  Table  3  shows  the  effect  of  this  combination.  This 
is  because  DTE  identify  idiomatic  proper  nouns  such  as  "new  york”  or  ukorea 
university”  more  accurately  than  DTW.  while  DTW  recognize  idiomatic  verb 
phrases  or  prepositional  phrases  better  than  DTE.  Fig.  3  shows  the  average 
precision  of  our  proposed  method  according  to  the  parameter  lambda.  The  best 
performance  was  obtained  at  0.5. 

Fig.  4  is  the  recall-precision  graph  of  the  baseline  and  the  proposed  method. 
The  x-axis  and  the  y-axis  indicate  the  recall  and  the  precision,  respectively.  The 
method  using  phrase  alignments  has  higher  precision  at  overall  recall  levels  than 
the  method  using  word  alignments.  Besides,  we  found  that  there  is  a  large  gap 
of  the  precision  in  0.2-0. 4  recall  levels  between  two  approaches,  while  they  are 
similarly  effective  in  0-0.1  recall  levels. 
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Table  4.  Effect  of  Linguistic  Constraint: 


Alignment  Type 

Scoring  Function 

AveP 

P@20 

P  <i30 

P@55 

Phrase  Alignment 

DTE+DTW 

0.508 

0.650 

0.633 

0.473 

(proposed  method) 

DTE+DTW+ Constraint 

0.519 

0.700 

0.633 

0.509 

Further  Improvement  with  Linguistic  Constraint.  So  far,  we  have  pre¬ 
sented  the  method  independent  of  any  language  pairs.  Now  we  prove  that  some 
linguistic  constraints  can  be  integrated  into  the  method  to  improve  the  per¬ 
formance  of  idiomatic  expression  identification.  In  this  experiments,  we  simply 
added  two  rules  to  the  scoring  process  as  follows. 

—  Rule  1:  Exclude  English  articles  such  as  “a”  or  ‘'the”  from  averaging  trans¬ 
lational  entropy  values  of  individual  words  in  a  phrase  in  DTE. 

—  Rule  2:  Exclude  Korean  functional  words  such  as  postpositions  and  endings 
e.g.  “'i:  (eul)’\  “-P-iiL  (eu-ro)”,  or  u°l]*]  (e-seo)”  from  Wp  in  DTW. 

We  expect  that  Rule  I  will  be  effective  for  our  task  because  English  articles  are 
usually  not  translated  to  any  Korean  words  in  English-Korean  translation.  Rule 
2  is  under  the  assumption  that  the  lion-compositionality  of  words  does  not  rely 
on  the  difference  of  functional  words  in  translated  phrases  of  the  source  words. 
The  figures  in  Table  4  imply  that  these  techniques  are  valuable  for  our  approach. 

4  Conclusion  and  Future  Work 

This  paper  proposed  a  method  for  identifying  idiomatic  expressions  using  phrase 
alignments  instead  of  word  alignments  in  a  bilingual  parallel  corpus.  In  this  work, 
we  focused  on  overcoming  the  limitations  of  previous  approaches  and  quantifying 
the  difference  of  the  translation  tendency  between  a  phrase  and  individual  words 
in  the  phrase.  We  proposed  two  scoring  functions  in  which  such  differences  re¬ 
flected.  The  experimental  results  showed  that  our  proposed  scoring  functions 
was  effective  in  idiomatic  expressions  identification.  Moreover,  we  presented 
that,  linguistic  constraints  can  be  integrated  into  our  method  to  improve  the 
performance. 

For  the  future  work,  we  first  intend  to  explore  the  method  using  not  only 
English-Korean  but  also  English-French  or  English- Chinese  parallel  corpora  to¬ 
gether,  in  order  to  identify  English  idiomatic  expressions.  Secondly,  we  plan  to 
identify  Korean  idiomatic  expressions  by  changing  only  the  translation  direction 
form  English-Korean  to  Korean-English.  Also,  we  intend  to  improve  the  quality 
or  the  efficiency  of  machine  translation  systems  with  the  idiomatic  expressions 
identified  by  our  approach. 
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Abstract.  Partial  deduction  is  an  optimisation  technique  developed  by 
the  logic  programming  community.  We  propose  the  use  of  Partial  de¬ 
duction  in  the  domain  of  wireless  sensor  network  programming  where 
programs  are  written  for  small  computational  platforms  and  energy  is 
typically  scarce.  Wc  show  how,  together  with  a  declarative  programming 
language  which  has  been  shown  to  be  suitable  for  several  demanding 
sensor  network  applications,  it  can  address  key  issues  such  as  rewriting  a 
query  using  view's  and  reducing  redundancy  of  rewritings  as  long  as  some 
computation  and  abstraction  can  be  performed  at  compile-time,  which 
obviously  leads  to  the  improvement  of  energy  efliciency  at  run-time.  We 
argue  that  energy  efficiency  can  be  achieved  with:  (1)  minimised  sensor 
network  programming  workload  by  forcing  the  folding  of  goals  into  the 
view  partially;  (2)  reduced  redundant  computation  with  few'er  computa¬ 
tion  steps  at  network  nodes  by  forcing  the  unfolding  of  simple  goals;  (3) 
reduced  inter-node  message  transmission  by  more  specific  addressing  of 
messages  to  nodes;  and  (4)  reduced  memory  requirements  by  specialising 
network- wide  programs  to  smaller  programs  for  specific  nodes.  A  partial 
deduction  system  is  developed  and  an  extended  example  is  provided  to 
demonstrate  the  potential  performance  improvement  of  the  technique. 


1  Introduction 

Wireless  sensor  networks  (WSNs)  promise  to  revolutionise  sensing  in  a  wide 
range  of  application  domains.  They  can  be  used  to  offer  tile  potential  to  advanee 
scientific  pursuits  in  areas  such  as  manufacturing,  agriculture,  and  transport  [1]. 
However,  wide  acceptance  and  deployment  lias  not  yet  occurred  because  of  lack 
of  robust  of  platforms  and  lack  of  fully  functional  support  for  data  manipulation. 
From  a  technical  point  of  view,  one  may  think  of  a  sensor  network  as  a  database 
that  is  able  to  conduct  query  processing,  which  includes  a  large  range  of  het¬ 
erogeneous  data  distributed  arbitrarily.  Other  than  that  in  a  traditional  DBMS, 
query  processing  works  differently  to  that  in  a  sensor  network  because  changes 
in  sensor  networks  may  happen  unpredictably  to  the  data  collection  regime  as 
sensors  come  and  go  in  addition  to  the  imperfect  link  quality.  Furthermore,  the 
sensed  events  and  sensing  intervals  may  vary  dramatically  on  different  occasions, 
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and  the  volume  of  the  sensing  can  be  very  variable  depending  on  sampling  rate 
variations.  Traditional  database  optimisation  techniques  of  specifying  join  meth¬ 
ods  and  indices  are  still  useful  but  the  unique  characteristics  of  sensor  networks 
should  be  considered  as  a  query  in  sensor  networks  may  be  based  either  on  live 
data  or  archived  data  or  a  mix  of  both  of  them.  As  for  archived  data,  wc  are 
interested  in  using  a  set  of  views1  V  expressed  in  terms  of  archived  data  sources 
to  associate  with  previous  query  results.  Roughly,  the  following  steps  are  needed 
to  process  the  query  if  the  end-use  would  like  to  “find  sensors  (i.e.  locations, 
sensor  IDs)  where  the  temperature  measurements  are  within  a  specific  range  X 
and  their  residual  power  at  least  Y  units,  and  send  the  new  temperature  to  its 
available  neighbours”,  including: 

1.  decide  if  the  query  can  be  fully  answered  by  using  views  V 

2.  if  not.  (using  views  as  many  as  possible)  develop  a  sensor  network  program 
(in  a  logic  programming  language)  with  respect  to  the  query 

3.  (applying  a  dedicated  optimisation  technique)  generate  an  efficient  sensor 
network  program  from  step  2)  to  cope  with  severe  resource  and  bandwidth 
constraints  oil  the  sensor  nodes 

Usually,  a  sensor  network  query  will  ask  for  live  sensor  readings.  Therefore  to  pro¬ 
vide  a  solution  to  the  last  two  steps  is  necessary,  and  this  will  be  the  main  focus 
of  this  paper.  Specifically,  for  a  query  expressed  in  a  logic  programming  language, 
we  are  looking  into  rewriting  this  query  using  views  first  and  then  specialising  this 
network-wide  program  to  a  smaller  program  for  the  specific  nodes  rather  than  all 
nodes.  We  propose  using  partial  deduction  to  achieve  the  goal,  fn  order  to  meet 
our  requirement,  the  design  of  fold/unfold  control  will  be  considered.  Particularly, 
we  arc  interested  in  problems  that  either  part  of  their  definitions  (e.g.  the  code  to 
solve  them)  are  available  or  bindings  of  variables  can  be  computed  at  compile-time 
as  this  sort  of  problems  will  benefit  considerably  from  partial  deduction.  In  saying 
so,  this  paper  makes  the  following  contributions: 

—  using  views  to  rewrite  a  query  for  sensor  network  query  processing  is 
discussed; 

fold/unfold  control  to  generate  a  compact  new  program  is  investigated: 
a  generic  partial  deduction  system  to  generate  a  smaller  program  is 
developed; 

the  cost  analysis  to  show  the  significant  difference  by  applying  partial  de¬ 
duction  is  given. 

In  order  to  ensure  that  partial  deduction  to  make  good  use  of  the  logic  structure 
of  a  problem  and  other  data  sources,  essentially  we  need  an  expressive  language 
to  describe  a  broad  range  of  problems  but  restrictive  enough  to  allow  efficient 
algorithms  to  operate  over  it.  In  fact,  as  pointed  out  in  [2. 3,4, 5, 6],  it  is  natural 
to  choose  a  declarative  language  to  describe  problems  (e.g.  queries)  as  it  is 
offers  an  easy-to-understand  programming  interface.  Moreover,  it  opens  up  the 

Views  are  simply  results  from  previous  queries.  I  To  log-style  notation  is  used  through¬ 
out  the  paper  for  views  and  queries. 
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possibility  for  optimisation  algorithms  to  handle  for  the  efficient  access  strategies 
transparent  to  the  user.  As  a  result,  we  will  use  a  logic  programming  language 
(c.g.  C)  throughout  the  paper.  From  a  programming  perspective,  we  will  not 
differentiate  wireless  sensor  network  (WSN)  programming  from  sensor  network 
(SN)  programming  in  this  paper. 

The  rest  of  the  paper  is  organised  as  follows.  Section  2  introduces  the  im¬ 
portant  definitions  and  background.  Section  3  discusses  partial  deduction  to 
generate  an  efficient  sensor  network  program.  Section  4  details  the  proposed  op¬ 
timisation  technique  with  an  extended  example  followed  by  the  cost  analysis. 
Section  5  briefly  reviews  the  related  work.  Section  0  presents  the  conclusion  and 
future  work. 

2  Preliminary 

In  this  paper,  query  processing  aims  to  generate  an  efficient  sensor  network 
program  with  respect  to  a  specific  query.  Informally,  if  we  have: 

—  a  query  Q  expressed  in  the  language  C 

a  set  of  views  V  expressed  in  terms  of  archived  data  source  S  also  in  £ 

—  a  generic  sensor  program  in  the  same  language 

and  we  want  to  generate  a  new  program  (i.e.  NewPgm  in  Fig.  1)  with  respect  to 
the  original  one  (i.e.  Pgm),  t lie  development  of  the  partial  deduction  system  is 
critical  to  the  success  of  query  processing. 


Fig.  1.  Generate  an  efficient  program 

Following  arc  some  definitions  to  better  understand  partial  deduction.  Refer 
to  the  logic  programming  literature  [7,8]  for  more  detailed  definitions. 

Definition  (clause).  A  clause  is  a  disjunction  of  literals.  In  first-order  logic, 
a  clause  is  the  universal  quantification  of  all  free  variables  of  a  quantifier- free 
disjunction  of  literals.  Formally,  a  first-order  literal  is  formula  of  the  kind  of 
P(t i,  ...,£n)  or  ->P(fi,  ...,  t„),  where  P  is  a  predicate  of  arity  r?  and  each  £  is  an 
arbitrary  term.  A  clause  is  usually  written  as  the  implication  of  a  head  from  a 
body.  In  this  paper,  we  consider  clauses  with  at  most  one  positive  literal. 

Definition  (conjunctive  query).  A  conjunctive  query  has  the  form  //(A*)  : 
—  B\  (Xj ), ...,  Bm{ Arm),  where  H (A*)  is  a  head,  B\  (X*)  is  a  sub-goal  in  the  body, 
and  the  tuple  X,-  contains  either  variables  or  constants.  All  queries  are  required 
to  be  safe,  i.e.,  that  X  C  AT]  U  ...  U  Xi . 
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Definition  (views).  A  set  of  view  definitions  (e.g.  clauses)  have  the  same  form 
(i.e.  represented  by  the  head  and  body)  but  expressed  in  terms  of  a  set  of 
database  relations.  For  example,  we  have  the  view  v\  in  a  form  of  i'i(Src)  : 
— residual  Power  (@Sr  Ci  Y)iY  >  1000.  It  means  that  v\  stores  all  sensors  (Ids) 
with  the  residual  power  greater  than  1000  units. 

Definition  (program).  A  program  is  a  finite  set  of  definite  clauses. 

Definition  (unifier).  A  unifier  of  two  terms  is  a  substitution  making  the  terms 
identical.  If  two  terms  have  a  unifier,  they  are  said  to  unify.  Further  explanation 
is  given  subsequently. 

Definition(unification).  Unification  is  performed  between  the  predicates  arid 
the  atoms  or  terms  in  a  program.  If  a  unification  succeeds,  that  is,  the  predicate 
names,  arity  (i.e.  the  number  of  arguments),  and  arguments  are  the  same,  the 
variables  (the  binding  of  the  variables)  will  be  instantiated. 

Definition  (unfolding).  Substituting  a  goal  in  the  body  of  a  clause  by  the  corre¬ 
sponding  body.  For  example,  unfolding  a  sub-goal  B,  in  a  clause  II  B\ . Bn 

with  respect  to  a  clause  B  C\, ....  Cm  where  B  and  B,  unify  with  0,  produces  a 
clause:  (A  B\, ....  ..,  Cm,  Bi+ 1 . Bn)0.  Unfolding  propagates  bind¬ 

ings.  In  this  paper,  unfolding  is  also  called  unification-based  propagation. 

Definition  (folding).  The  inverse  of  unfolding,  whereby  an  instance  of  a  pred¬ 
icate  is  substituted  by  the  corresponding  call.  More  discussion  is  available  in 
Section  1. 

Definition  (partial  deduction).  A  system  for  controlled  folding/ unfolding  is 
known  as  partial  deduction.  It  is  often  used  for  specialising  a  program  with 
respect  to  the  incomplete  input. 

We  are  developing  solutions  to  handle  query  processing  in  sensor  networks. 
Early  work  which  discusses  query  rewriting  algorithms  [9]  and  semantic  sensor 
network  service  framework  design  [10]  have  been  reported  and  they  arc  integral 
parts  of  query  processing.  However,  we  will  focus  on  using  partial  deduction  to 
optimise  a  sensor  network  program  in  this  paper. 

3  Partial  Deduction  and  Its  Impact 

Part  ial  deduction  is  especially  useful  for  removing  levels  of  interpretation  1 1  to 
generate  a  specialised  program  that  generally  does  far  more  efficiently  than  the 
generic  program.  This  specialised  program  is  consciously  tailored  to  a  particular 
task.  The  theoretical  underpinnings  of  this  approach  were  discussed  in  [12,8]. 
The  main  idea  of  partial  deduction  is  to  recursively  perform  fold/unfold  until 
no  more  progress  can  be  achieved  1 II.  A  few  things  must  be  included  to  build 
such  a  kind  of  system.  Generally  we  consider: 

1.  The  residual  program,  which  is  equivalent  to  the  original  ones,  should  be 
kept . 

2.  Clauses  must  be  handled  as  well  as  goals.  Folding  the  head  and  unfolding 
the  body  are  highly  desired. 
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3.  A  few  declarations  must  be  made  explicitly,  for  example,  which  goal (s) /sub- 
goal  (s)  should  be  folded,  unfolded  or  left  alone. 

(a)  if  folding  is  required,  to  what  they  should  be  folded  (to  make  good  of 
using  views  as  many  as  possible).  Usually,  it  requires  unfolding  first  (bindings 
propagation)  to  allow  more  specific  folding. 

(b)  unfolding  rules  are  required  to  control  unfolding,  It  is  the  inverse  of 
folding. 

In  addition  to  the  general  met  a- interpreter  discussed  in  [11  and  the 
unfold  criteria  in  13]  it  is  preferable  to  consider  I  he  unique  characteristics  of 
a  sensor  network  while  defining  folding/unfolding  rules,  for  example,  we  need 
to  consider  the  sensor  network  (programming  language)  built-in  predicates. 

(c)  the  empty  goal  and  true  will  be  handled  after  both  (a)  and  (b)  have 
been  performed  recursively  in  the  obvious  way. 

After  such  a  system  lias  been  developed,  it  is  expected  that  the  NewPgm  is  smaller 
than  the  original  (also  generic)  sensor  program  (i.e.  Pgm).  In  addition,  it  has 
potential  to  reduce  the  size  of  the  message  and  (total)  data  transmissions  as 
well.  They  are  illustrated  by  the  following  examples. 

Example  1.  Suppose  there  are  two  sensor  nodes  in  the  network,  ©1  and  @2, 
respectively.  The  notation  u©”  in  @Id  means  the  host  of  the  tuple  at  Id.  For 
example,  one  single  rule  q(@2.  c,  1)  is  hosted  at  the  node  with  Id  =  2.  There 
is  another  rule  hosted  at  node  1:  /)(@1,AT,  Y)  :  —  </(@2,  X,  K),  Y  =\=  1.  Fig.  2 
shows  the  differences  between  two  cases  (1)  and  (2). 

(1)  without  partial  deduction  (Fig.  2(a)):  at  least  two  variables  X  and  Y  are 
required  to  send  from  node  2  to  node  1  to  be  evaluated  at  node  1. 

(2)  after  partial  deduction  (Fig.  2(b)):  as  the  fact  at  node  2  specifies  the 
variable  Y  to  be  instantiate  to  1  the  second  rule  will  not  be  fired.  As  a  result, 
no  variable  is  required  to  be  transmitted  as  did  in  Fig.  2(a). 


P  q  p  q 


Fig.  2.  Impact  of  partial  doduction(l)  Fig.  3.  Impact  of  partial  deduction(2) 

Example  2.  Suppose  there  are  eight  nodes  distributed  as  shown  in  Fig.  3(a)), 
and  the  original  program  is  about  10 K  bytes  in  size.  In  Fig.  3(a),  the  total 
data  transmission  is  about  70 A  bytes  in  size.  Suppose  after  applying  partial 
deduction,  the  program  was  specialised  to  \Kb  for  each  of  nodes.  Then  the  total 
data  transmission  is  about  13 K  as  shown  in  Fig.  3(b)),  which  is  obviously  less 
than  70 K.  Apart  from  reducing  data  transmission,  partial  deduction  also  plays 
an  important  role  in  code  decomposition  as  one  particular  code  was  generated 
with  respect  to  each  of  nodes  by  having  taken  account  of  some  computation  and 
abstraction  at  that  node. 
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4  Query  Processing  and  Cost  Analysis 

In  this  section,  we  will  first  look  into  ail  extended  sensor  network  example  to 
gain  insight  on  partial  deduction.  We  then  present  the  cost  analysis  which  clearly 
indicates  that  partial  deduction  has  potential  to  improve  query  processing  in 
sensor  networks.  Before  we  describe  partial  deduction  in  more  detail,  we  give 
a  brief  description  of  the  generic  sensor  network  program  used  throughout  the 
paper. 


4.1  Running  Example 

the  original  program  (Pgm) : 


result  (fflNext  ,Src  ,Val  .NewCost)  */,...  (1) 

message (fflSrc, Next ,Dest ,Val)  , 
timer(fflSrc ,3, TimePeriod) , 

candidate (fflSrc ,Dest .Next , NewCost .NewHops) . 
timer(fflSrc  .TimerlD  .TimePeriod)  */....  (2) 

timerx (fflSrc .Timer ID .TimePeriod) . 
message (fflSrc , Next , Dest ,Val)  7.  ...(3) 

sensorId(fflSrc , Id) , 
sensorMeasure (fflSrc , Id, Vail , Val2) , 
range (ReqVal 1 , ReqVal2) , 

Val l-<ReqVal 1 , ReqVal2=<Val2 , 
reading (fflSrc ,Id,H:M:S,Val) , 
residualPower(fflSrc , Z) ,Z>1000. 
sensorId(fflSrc , Id) : -  */,...  (4) 

sensor (Src) , 

Id  =  Src. 

reading (fflSrc ,  ld,H  :M:S,Tval)  7...  (5) 

sampl ing (fflSrc .TimePeriod, TimePeriod, H:M :S,Tval) . 
candidate  (fflSrc  ,Dest ,  Next ,  NewCost ,  NewHops)  7....  (6) 
beacon(fflSrc .NewNext ,Dest .OldNext .DldCost .DldHops) , 
nextHop(fflSrc, Dest, _,_,_) , 
linkLqi (fflSrc .NewNext .LinkCost) , 

DldNext  =\=Src , 

NewCost*DldCost+LinkCost , 

NewHops=*DldHops+l , 
dest (fflSrc ,Dest) , 

Src=\=Dest . 


Following  is  a  brief  explanation  of  predicates  in  the  clauses  (1)  (G). 

The  predicate  result/ 4  can  he  interpreted  as:  if  there  exist  a  message  (i.e. 
message/ 4)  and  a  candidate  (i.e.  candidate /5)  whose  next  hop  is  the  Next. 
and  if  the  timer  fires,  then  sends  the  message  to  the  N ( xt . 
the  predicate  message/ 4  is  further  defined  to  symbolise  the  message  by 
describing  where  the  tuple  will  be  sent  to  (i.e.  Next),  the  origin  of  the 
message  (i.e.  Src),  and  the  destination  (i.e.  Dest).  The  variable  V al  is  the 
content  of  the  message.  It  is  the  current  temperature  reading.  However,  the 
sensing  (i.e.  reading/ 4)  will  not  take  place  until  it  is  certain  that  this  sensor 
has  the  capability  to  obtain  it  (that  is,  within  the  required  range)  and  the 
node  residual  power  is  greater  than  1000  units. 

the  predicate  sensor  I  d/2  is  defined  to  link  Src  and  Id  together,  with  Id 
specifying  which  sensor  is  currently  concerned. 
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—  the  predicate  residual  Power / 2  measure  the  residual  power  at  the  node. 

—  the  predicate  reading/ 4  is  defined  to  present  the  sensing  data  with  a  given 
sampling  rate  within  a  sampling  period. 

—  the  predicate  sampling / 5  is  defined  as  a  built-in  predicate.  To  be  brevity, 
the  sampling  rate  and  sampling  period  are  set  to  same  value  in  this  program. 

—  the  predicate  candidate/ 5  can  be  understood  as  a  candidate  to  receive  a  new 
beacon  message  with  NewCost  and  New  Hops  to  be  updated  accordingly. 

—  the  predicate  next  Hop/ 5  is  defined  as  another  built-in  predicate  to  indicate 
the  next  hop  that  the  message  should  head  for. 

—  the  predicate  linkLqi/3  is  also  defined  as  a  built-in  predicate  to  represent 
the  last  received  packet  from  the  source  (i.e.  @Src). 

We  will  use  the  view  v\  introduced  in  Section  2  to  rewrite  part  of  the  clause  (3). 
We  are  aware  that  different  algorithms  [9,14]  for  query  rewriting  exist.  In  this 
paper,  we  use  the  equivalent  rewriting  algorithm  for  demonstration. 

Now  let  us  revisit  the  sample  query  introduced  in  Section  1.  The  query  is  about 
to  “find  sensors  where  the  temperature  measurements  are  within  a  specific  range 
X  and  their  residual  power  at  least  Y  units,  and  send  the  new  temperature  to  its 
available  neighbours".  The  predicate  find/ 4  is  defined  to  represent  the  goal  at 
an  abstract  level.  The  developed  partial  deduction  system  consists  of  two  parts: 
the  partial  reducer  and  fold/unfold  control.  The  following  code  snippet  gives  a 
brief  idea  of  how  a  partial  reducer  looks  like. 

partial  reducer: 

do_fold(Hl ,H2)  : - 
f oldCHl ,H2) ,  ! . 
do_unfold((H:-B),(H:-NB))  :- 
conjunct_to_list (B ,BL) , 
unfold (BL.NBL) , 
list_to_conjunct (NBL,NB) . 


Note  that  as  unfolding  first  allows  more  specific  folding  [1 1  .  we  have  to  consider 
in  which  order  the  fold/unfold  rules  to  be  fired.  In  the  example,  all  sub-goal(s) 
in  the  original  program  Pgm  should  be  unfolded,  while  result/ 4  should  be  folded 
into  find/ 4.  Following  is  a  fragment  of  fold/unfold  control  in  our  example. 

fold/unfold  control: 


'/.variable  'Clauses’  are  clauses  from  Pgm 
progr  sun  (Pgm,  Clauses)  . 

’/.unfold  all  sub-goals  from  Pgm 
unfold (message (®Src,Next ,Dest , Val) tNm) . 


’/.fold  result/4  to  find/4 
fold(result(®Next ,Nodel , Val .NewCost) , 
f ind (®Src , Nodel , Val, NewCost)) . 


Putting  the  preceding  partial  reducer  and  fold/unfold  control  together,  the  Pgm 
will  be  specialised  into  the  following  new  program  NewPgm  (Note  that  Nodel  is 
12,  Node2  can  be  either  11  or  15),  given  the  view  V\ .  We  illustrate  Node  1  = 
12,  Node2  —  1 1  only. 
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NewPgm  (the  variables  have  been  renamed  by  the  system) : 
find (312, 12, 11, _G 1413)  :- 

sampling(® 12 ,_G1422 , _G 1422, _G 1429 : _G1432: _G1433, _G1410) , 
nextHop(312 ,0,_G1440, _G1441 , _G1442) , 
linkLqi (® 12 , _G1447  ,  _G1416) . 


This  new  program  entirely  replaces  the  original  one  given  earlier.  It  is  specialised 
from  clauses  (1)~(6).  Clearly,  the  specialised  code  is  more  compact.  This  is 
because  among  available  nodes,  only  nodes  { 12},  {11,  15}  meet  the  requirements, 
given  the  view  V\.  Other  advantages  will  be  discussed  in  tin'  subsequent  section. 

4.2  Cost  Analysis 

For  the  cost  analysis,  we  first  analyse  the  cost  matrix  in  our  example.  Since  our 
focus  is  on  query  processing,  only  query  related  cost  will  be  taken  into  account 
in  this  paper.  We  will  consider  data  acquisition  and  transmission  in  the  future 
work.  Thus,  the  estimated  cost  is  defined  as  the  combination  of: 

-  (c -ii ode)  rule  evaluat  ion  associated  cost  at  each  of  nodes 

-  (cJrans)  the  number  of  variables  transmitting  between  two  nodes 

Following  notations  are  used  to  simplify  the  analysis.  They  are: 

r  -  the  number  of  rules /clauses  in  a  program: 

/t  -  the  number  of  predicates  in  rule  rt ; 

dj  -  the  number  of  variables  transmitting  between  two  nodes  when  one  predicate 
is  concerned  (see  Fig.  2(a)  for  example). 

We  assume  v77  by  \/n  square  grid  topology  for  the  analysis.  The  basic  idea  is  that 
we  are  able  to  count  the  number  of  t  ransmissions  with  \fn  —  1  hops  at  most  (e.g. 
diagonal  routing),  which  is  the  longest  path  from  one  end  to  another.  We  also  as¬ 
sume  that  there  are  m  sensors  in  the  network.  In  our  cost  model,  the  total  number 


of  predicates  is  defined  as:  p  =  =  1  /,.  ...(fl) 

w  ith  these  notations,  c-cost  is  given  below 

c-tiode  —  mx^j  /,  ...(f2) 

and  the  cost  of  variable  transmission  is  defined  as: 

eJrans  =  {  dj  x(y/n—l)  ...(f3) 

Substituting  p  in  formula  (f3)  by  formula  (fl),  the  total  cost  would  be: 

r -total  —  c-iiode  -f-  cJ runs  =  in  x  x  /,  +  (y/n  —  1)  x  ,  1 1'  dj )  ...(f4) 


The  influence  of  partial  deduction  on  cJotal  is  obvious  when  the  number  of  rules 
(i.e.  v  in  the  formula  ( f4 ) )  to  be  fired  was  reduced.  Moreover,  in  most  eases,  in¬ 
stead  of  all  nodes  to  he  involved  in  query  processing,  only  relevant  nodes  specified 
by  the  specialised  program  will  be  active.  As  such,  the  number  of  the  involved 
sensors,  m,  decreases  tremendously.  Consequently,  e -total  in  formula  (f4)  w  ill  be 
reduced  accordingly. 
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rl  lie  detailed  cost  analysis  based  on  the  preceding  example  is  given  in  Table  1. 
For  simplicity,  all  predicates  are  treated  equally  in  the  table.  The  minimum 
number  of  variables  in  a  predicate,  say,  2,  is  used  for  the  analysis.  Based  on  the 
analysis  shown  in  Table  1,  it  is  clear  that  a  significant  difference  in  cost  between 
the  Pgm  and  NewPgm  exists.  We  employ  a  metric,  called  Diff  to  quantify  the 
cost  savings  in  query  processing  as  a  result  of  partial  deduction.  The  Diff  is 
computed  as  the  difference  between  the  sum  of  the  cost  of  Pgm  and  NewPgm.  An 
estimation  is  given  as  follows. 

Diff  Jotal  =  cJotal(Pgm)  —  cJotal(NcwPgm) 

=  27  x  in  +  54  x  m  x  (v fn  —  1)  —  22  x  (y/ri  —  1)  -  8 

—  27  x  7/1  +  43  x  m  x  (y/ri  —  1)  +  11  x  m  x  (y/ri  —  1)  —  22  x  (y/ri  —  1)  —8  ...(f5) 

Note  that  m  2  in  this  example,  thus,  the  second  and  third  terms  of  (f5)  can 

be  removed  if  we  simply  let  w  =  2  for  the  term  “11  x  m  x  (y/ri  —  l)1',  then  we 
have 

Diff  .total  >  27  x  7n  +  43  x  m  x  (y/n  —  1)  —  8 

>  26  x  m  +  43  x  m  x  (y/ri,  —  1)  +  27  x  7/?  —  8 

>  26  x  777  +  26  x  m  x  (y/ri  —  1) 

>  26  x  111  X  y/T) 

>  m  x  y/n 

Thus,  the  “order”  of  the  calculation  in  the  Big  O  notation  is  0(my/ri)  (m  2. 
v/r7  >  1). 


Table  1 .  Cost  analysis 


Cost  criteria 

Pgm 

NewPgm 

V 

6 

1 

V 

27 

4 

No.  of  sensors 

m  »  2 

2 

cjcost 

27  x  m 

8 

No.  of  variables 

>  27  x  2  =  54 

11 

c  Jr  arts 

54  x  m  x  ( y/ri  —  1 ) 

22  x  (x/n-  1) 

cJotal 

27  X  /n  +  54  x  rn  x  (y/n  —  1) 

22  x  (y/n-  1)  +  8 

With  partial  deduction,  generally,  we  have  achieved: 

the  new  program  is  smaller  and  more  compact  than  the  original  one 
—  the  storage  011  nodes  has  been  reduced  as  only  fewer  nodes  need  to  consider 
it 

inter-node  message  transmission  (i.e.  variables  transmission)  lias  been  re¬ 
duced  by  more  specific  addressing  of  messages  to  nodes. 

All  these  would  make  the  improvement  of  the  execution  performance  possible 
due  to  the  computation  and  space  complexity  having  been  reduced. 


Generating  an  Efficient  Sensor  Network  Program  by  Partial  Deduction 


M3 


5  Related  Work 

Following  arc  brief  overviews  of  the  related  work  in  sensor  network  query.  Three 
categories  are  identified  below.  Let  11s  first  start  from  database  query. 

5.1  Database  Query 

TinyDB  [15,16]  ( http://tclegraplucs.berkclGy.cdu/imydh/ ),  a  seminal  first- 

generation  SN  database,  was  developed  by  UC  Berkeley.  TinyDB's  structure 
allows  queries  to  be  parsed  and  optimised  at  the  base  station.  The  optimisation 
phrase  is  focused  on  choosing  the  correct  ordering  of  sampling,  selections,  and 
joins  with  the  help  of  metadata  [16].  However,  little  or  no  work  of  partial  de¬ 
duction  has  been  reported  and  it  is  most  likely  that  the  non-special ised  binary 
format  of  the  queries  are  sent  into  the  sensor  network,  where  they  are  instanti¬ 
ated.  This  contrasts  dramatically  with  our  approach  where  the  instantiation  is 
performed  at  the  sink  and  abstraction  is  performed  with  unification-based  prop¬ 
agation  throughout  the  program  at  compile-time  before  the  specialised  program 
to  be  distributed  to  the  sensor  network. 

Cougar  [4, 1 7]  (http://www.es.  Cornell,  edn/ database/ cougar  /index.php) 
discusses  queries  over  sensor  networks  by  allowing  users  to  task  the  network 
by  adding  a  query  layer  above  the  networking  layer  in  the  protocol  stack  [5]. 
In-network  aggregation  is  the  focus  of  the  paper.  Again,  known  knowledge  has 
not  yet  he  discussed  sufficiently.  To  our  knowledge,  none  of  TinyDB  and  Cougar 
lias  fully  taken  advantage  of  the  knowledge  known  in  priori  explicitly  in  query 
processing. 

Campt  oil's  paper  [9]  investigated  a  maximal  rewriting  using  views  in  the  pres¬ 
ence  of  functional  dependencies  and  value  constraints.  It  will  be  studied  further 
in  onr  work. 

5.2  Query  Programming  Language  and  Platform 

Efforts  from  UC  Berkeley  present  the  design  and  implementation  of  a  declarative 
sensor  network  platform  (DSN)  [2]  which  include  a  declarative  language  (i.e. 
Snlog),  compiler  and  runtime  which  is  supported  by  TinyOS  (htip://www.  tiny  os. 
net./).  At  the  core  of  the  platform  lies  the  Snlog  compiler  that  transforms  the 
Snlog  specification  into  nesC  language  which  is  native  to  TinyOS.  The  generated 
codes,  together  with  relevant  compiler  libraries,  arc  further  compiled  bv  the  nesC 
compiler  into  binary  image  to  injected  into  the  nodes  in  the  network.  The  focus  of 
the  DSN  is  on  providing  a  single  high-level  programming  environment.  Authors 
in  [2,18]  have  addressed  traditional  sensor  network  protocols  and  demonstrated 
that  DSN  is  a  natural  fit  for  sensor  networks.  However,  there  is  no  discussion 
about  implementing  an  efficient  query  processing  by  taking  advantage  of  the 
known  resources.  We  attempted  to  address  this  issue  to  reduce  computation  and 
bandwidth  usage  and  eventually  minimising  data  transmission  for  the  resource 
constrained  WSN.s. 
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The  SNEEql  (Sensor  NEtwork  Engine  query  language)  query  optimise!'  [19] 
(http://intranet.es.man.ac.uk/img/dias-mc/sneeql-overview.php)  is  a  recent  at¬ 
tempt  which  combines  an  expressive  query  language  with  a  layered  architecture 
to  generate  an  executable  nesC  code.  However,  the  proposed  approach  not  seem 
to  consider  how  to  generate  an  efficient  sensor  network  program. 

5.3  SensorWeb 

Other  relevant  work  comes  from  sensornet  (http://www.sensoniei.gov/)  and  sen¬ 
sor  Web  [201,  where  the  knowledge  known  a  priori  has  been  used,  either  as  an 
ontology  or  a  repository,  to  improve  query  processing  at  the  service  level.  Re¬ 
ported  work  has  proposed  a  service-oriented  framework  to  handle  both  data 
streams  from  WSNs  and  information  retrieval  requirements.  These  projects  have 
different  views  about  WSNs  and  none  of  them  attempted  to  consider  the  efficient 
sensor  network  program  generation  in  WSNs. 

As  discussed,  we  are  interested  in  rewriting  query  using  views  and  then  re¬ 
ducing  redundancy  to  generate  an  efficient  sensor  network  program.  We  have 
demonstrated  that  partial  deduction  has  potential  to  improve  the  application 
performance. 

6  Conclusion 

We  argue  that  efficiency  of  a  sensor  network  program  can  be  improved  with 
partial  deduction  by  using  views.  We  highlighted  the  significance  of  partial  de¬ 
duction  in  query  processing.  We  have  demonstrated  that  redundancy  can  he 
reduced  considerably  as  long  as  some  computation  and  abstraction  can  be  per¬ 
formed  at  compile- time. 

We  are  aware  of  the  inherent  limitations  of  partial  deduction,  but  for  a  class 
of  problems,  we  argue  that  they  are  much  more  gainful  from  partial  deduction  if 
either  part  of  their  definitions  (the  code  to  solve  them)  are  available  or  bindings 
of  variables  can  be  computed  at  compile-time.  We  have  demonstrated  that  for 
these  kinds  of  problems,  the  advantages  of  partial  deduction  greatly  outperforms 
its  disadvantages.  The  analysis  also  suggests  that  it  is  promising  to  apply  the 
proposed  optimisation  technique  to  reduce  redundancy  to  better  address  tight 
energy  and  bandwidth  issues  in  sensor  network  applications. 

It  is  generally  expected  that  an  automatic  sensor  network  program  generator 
is  developed  to  lessen  the  heavy  burden  of  rewriting  an  arbitrary  query.  In  order 
to  achieve  this  goal,  we  plan  to  study  sensor  network  query  rewriting  in  depth 
in  the  future. 
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Abstract.  In  this  paper,  conditional  localization  and  mapping  (CLAM)  is 
realized  with  a  stereo  camera  as  the  only  sensor.  Compared  with  visual 
simultaneous  localization  and  mapping  (SLAM),  the  framework  of  CLAM  is  a 
novel  proposed  condtional  filter  rather  than  extended  Kalman  filter  (EKF).  In 
this  algorithm,  there  is  no  camera  velocity  information  in  the  filter  state,  the 
measurements  and  state  equation  all  depend  on  image  data  which  are  the  most 
reliable  information  so  that  CLAM  outperforms  SLAM  when  the  camera  turns 
abruptly  or  there  are  some  frames  lost  in  which  conditions  the  SLAM  may 
diverge  quickly  because  the  predefined  model  is  incorrect  in  such  eases.  For 
CLAM,  the  model  is  derived  from  image  data  so  that  CLAM  has  no  such 
problems.  The  experimental  results  show  that  the  proposed  CLAM  is  robust  to 
abrupt  turning  of  the  eamera  and  frame-losing,  and  also  give  the  precise  3D 
information  about  the  features  and  the  trajectory  of  the  eamera. 

Keywords:  Stereo  Camera,  CLAM,  conditional  filter. 


1  Introduction 

In  the  past  decade,  significant  progress  has  been  made  in  autonomous  robot 
navigation.  SLAM  has  become  more  and  more  popular  in  roboties  as  a  solution  to  the 
question  of  a  moving  sensor  platform  constructing  a  map  of  its  environment  during  its 
first  navigation  while  concurrently  estimating  its  position  and  direction!  1,  2,  3]. 

Early  work  was  done  in  sonar-based  navigation  of  mobile  robots  using  the  Kalman 
filter  algorithm,  as  in  [4]  and  [5].  Although  sonar  signals  are  insensitive  to 
illumination  variance,  they  are  inaccurate.  Compared  with  sonar  signals,  images 
captured  by  camera  are  compact,  accurate,  and  well  understood. 

As  for  visual  map  building,  Moutarlier  and  Chatila  [6]  proposed  an  approach 
taking  account  of  all  correlations  in  general  robot  localization  and  mapping  problems 
within  a  single  state  veetor  and  eovariancc  matrix  updated  by  the  extended  Kalman 
filter  (EKF).  Several  early  implementations  verified  the  single  EKF  approach  for 
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building  modest-sized  maps  in  real  robot  systems  and  convincingly  demonstrated  the 
importance  of  maintaining  estimate  correlations.  These  successes  gradually  resulted 
in  very  widespread  adoption  of  EKF  as  the  core  estimation  technique  in  SLAM.  The 
most  successful  visual  SLAM  using  a  monocular  camera  was  recently  developed  by 
Davision  [7,  8,  9],  whose  contributions  inelude  an  active  approach  to  mapping  and 
measurement,  the  use  of  a  general  motion  model  for  smooth  camera  movement,  and 
solutions  for  monocular  feature  initialization  and  feature  orientation  estimation. 
Civera  [10,  11,  12]  enhanced  Davision’s  work  by  introducing  inverse  depth  for 
feature  points,  producing  measurement  equations  with  a  high  degree  of  linearity. 
Thomas  [13,  14)  realized  vision-based  SLAM  using  stereo  camera,  monocular  camera 
and  Panoramic  camera,  respectively.  This  approach  can  deal  with  close  large  loops. 
All  the  above  approaches  built  a  map  of  the  environment  with  feature  points  and  the 
trajectory  of  the  camera.  However,  many  images  must  be  obtained  in  a  short  time.  In 
addition,  the  camera  should  move  smoothly  because  for  EKF,  if  the  estimate  of  the 
state  is  wrong,  EKF  may  diverge  quickly  owing  to  its  linearization. 

In  this  paper,  we  address  the  problem  that  the  filter  diverges  when  the  eamera  turns 
abruptly.  We  are  inspired  by  the  conditional  filters  [15]  first  proposed  for  point 
tracking.  The  proposed  CLAM  also  includes  a  condition  with  respeet  to  image  data. 
The  camera  state  is  predicted  from  image  data  whieh  is  much  more  reliable 
information  than  the  previous  knowledge.  In  the  case  of  monocular  SLAM  [9],  it 
faces  the  problem  of  the  scale  so  that  it  needs  additional  sensors  or  some  a  priori 
knowledge.  Stereo  camera  that  provides  scale  through  the  baseline  is  used  in  the 
proposed  CLAM.  For  the  close  points  which  present  large  disparity  on  the  stereo 
image,  they  are  initialized  as  3D  points  which  will  provide  distance  and  orientation 
information. 

The  paper  is  organized  as  follow.  In  Section  2,  we  introduce  the  conditional  filter. 
Section  3  gives  the  details  of  the  CLAM  system.  Section  4  provides  the  experimental 
results.  In  Section  5,  we  draw  the  conclusion  of  this  paper  and  future  work. 


2  Conditional  Filter 


In  [15],  a  conditional  linear  filter  and  a  conditional  nonlinear  filter  are  derived  based 
on  Kalman  11  Iter  and  particle  filters,  respectively.  For  our  cases,  we  propose  another 
conditional  filter  as  an  extension  of  EKF  with  respeet  to  image  data. 

Let  I  j.  denote  an  image  obtained  at  time  k.  The  sequence  of  images  = 

0 . //}  will  be  represented  by  l() .  The  nonlinear  image-based  filtering  problem  can 

be  represented  by  the  following  system; 


xk 


(XJL- 1 


OX 
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(1) 


7  k 


'OX 


(x 


Jt — 1 


OX 

k 


(2) 


The  index  ia  indicates  a  dependence  on  the  image  data.  Note  that  \  is  the  system 
state  and  **  is  the  measurement  at  time  A  .  Functions//"*  and  hlkiH  may  be  estimated 
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from  lui  .  Variables  w|**  and  vjltt  ,  which  are  process  noise  and  measurement  noise, 

respectively  are  zero  mean  independent  white  noises  with  covariances  and  RkH 
respectively. 

A  condition  is  included  with  respect  to  image  data;  thus  the  equations  for  the 
optimal  filter  can  be  applied  to  the  proposed  model. 

The  state  is  assumed  to  be  a  Gaussian  vector.  The  Gaussian  probability  density 
function  (pdf)  is  completely  characterized  by  the  mean  and  covariance  matrix.  The 
filter  can  be  represented  by  a  recursive  process  including  prediction  and  update 
phases.  The  process  goes  like  this: 


Step  1.  Initialization  of  Xo  ,  P  . 


where  Xo  is  the  filter  state,  P  is  the  associated  state  covariance. 

Step  2.  Estimation  of  functions  and  matrices  fkak  ,  h!*k  ,  and  R'1  from  the 


image  sequence. 

Step  3.  Prediction 

X/U*  l  = 

,(F/- )'■+</“ 

where  F.1"  =  — —  I- 

3x  “ 

Step  4.  Compute  z. 

Step  5.  Update: 

K,=Pm-.H:(H.pt,*-i  h:+r,)" 

Xi  =  x»  :  +  K,  (z,  -z k  ) 

d/,',u 

where  H  t  = - 1 

dx  *“■ 

Step  6.  Repeat  Steps  2-5  until  no  unprocessed  image  remains. 


Z k  is  the  predicted  measurement  derived  from  Xo*-i  according  to 

Zk  =hlkal(Xk\k  l)  (3) 

Zk  is  the  measurement  obtained  from  image  lk  .  It  is  always  computed  by 
matching  technique. 


3  CLAM  System 

The  difference  between  the  SLAM  and  the  CLAM  is  shown  in  Figure  1.  Here,  we 
take  one  loop  from  the  two  recursive  processes  for  example. 
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Fig.  L  Tlie  difference  between  the  SLAM  and  the  CLAM  (a)  The  process  of  the  SLAM  (b)  the 
process  of  the  CLAM. 

For  the  SLAM,  x*.,u  is  derived  from  x<  according  to  the  linear  and  angle  velocity  of 
the  camera  that  may  not  be  precise  in  certain  conditions  such  as  abrupt  turning,  frame 
losing,  actually  the  velocity  that  is  used  to  predict  the  state  in  time  k  +  1  is  the  velocity 

of  the  camera  in  time  k  .  While  in  the  CLAM,  Xt*ut  is  computed  from  x*  andl4*+, , 
the  change  of  speed  and  direction  of  the  camera,  is  computed  from  I  and  I|4l . 

3.1  The  State  Vector 

The  state  of  the  CLAM  X  is  composed  of  the  camera  state  X  and  feature 
states  XF  .  As  a  matter  of  fact,  the  result  is  represented  by  all  the  states  of  the  CLAM 
system 

X(. 


X  = 


(4) 
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The  associated  state  eovarianee 


P  = 


(5) 


The  stereo  camera  is  described  by  the  position  of  its  optical  center  r  and  its 
orientation  in  Euler  angles  (p 


x 


fr“' 

<9W ; 


(6) 


For  feature  states,  Inverse  depth  1 10]  used  in  monocular  SLAM  is  proved  to  represent 
the  distribution  of  features  at  infinity  as  well  as  elose  points,  allowing  performing  an 
undelayed  initialization  of  features.  Despite  its  properties,  each  inverse  depth  point 
needs  an  over-parameterization  of  six  values  instead  of  a  simpler  three  coordinate 
spatial  representation.  This  produces  a  computational  overhead.  Here,  working  with  a 
stereo  camera,  which  can  estimate  the  depths  of  points,  the  feature  point  is  defined  in 
terms  of  Euclidean  coordinates. 


b(ur  -uQ) 
6(v  -v0) 

",  ~lir 

fly 

"/  ~Ur 


(7) 


where  h  is  the  baseline  of  the  stereo  camera,  /  is  the  foeal  length  of  the  eamera, 
(«0,v0)  is  the  image  center,  and  (,/r,vr)are  the  image  coordinates  on  the  left 

and  right  images  respectively. 


3.2  Prediction  of  the  State 


In  this  step,  the  state  is  estimated  from  current  and  previous  images. 

Firstly,  for  both  left  and  right  images,  the  image  coordinate  of  the  feature  point  in 
time  k  is  estimated  from  lk_lk  according  to  the  robust  parametric  motion  estimation 
approach  [16] 

=  +  ^((M/a-i)’ v/<A-i))  )®/a  (8) 

’  Vri  )  =  ("rU-D’  '>,*-!>)  +  ^Ur{  k -1 )  ’  V  r{  k  -I )  rk  +  ®rk 


where  v)1 )  = 


1  u  v 
0  0  0 


0  0  0 

1  u  v 


,  (itik,v,k )‘  and  (urk ,  vrk  )T  are  the  coordinates 


of  the  same  feature  point  on  the  left  and  right  images  at  time  k  .  0*  is  the  parameter 
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vector  which  contains  the  polynomial's  coefficients.  &>is  assumed  to  be  a  white  noise 
of  zero  mean  and  covariance  with  respect  to  image  data. 

Secondly,  depending  on  the  estimated  position  of  feature  points  on  the  both  images 
at  time  k  - 1  and  k  .  the  position  and  direction  of  the  feature  can  be  derived  with 
respect  to  the  camera 


-«o) 

bKk-h  -vo) 

Jh 

uHk-u  ~ur(k  n 

liHk  1) 

Uiik- 1) 

b{urk  -«„) 

Mv*  -v0) 

Jh 

r 

~Urk. 

(10) 


(11) 


Using  the  two  sets  of  correspondent  points,  (x/a  h)  and  {xa}thc  translation  and 
rotation  of  the  camera  can  be  computed  using  the  method  in  [  17|  By  defining: 

_  I  n  _ 

x/<*  n  =T7^x/;u-i>  x /;<*-!)  “  x/u- 1>  “  x/u  i> 

*  ’  /=! 


(12) 


A  correlation  matrix  H  is  defined  by: 


(13) 


The  singular  value  decomposition  of  H  is  given  by  H  =  UAV7  ,  the  optical  rotation 
matrix  R  can  be  derived  by  R  =  Vll;  .  The  optical  translation  of  the  camera  T  can 
be  calculated  by  T  =  xra  ,,  -  RxA  . 


x 


tk'.k  i 


rv' 

Lk\k- 1 

X 

X 

1 

9- 

(14) 


For  the  feature  points,  because  they  arc  static  with  respect  to  the  3D  map,  they  are 
predicted  as  the  same  with  the  previous  state. 


3.3  Feature  Point  Selection  and  Management 

A  good  feature  point  selection  algorithm  is  important  for  the  whole  system.  A  point  is 
considered  to  be  tracked  reliably  if  its  neighborhood  defines  a  luminance  pattern  that 
carries  enough  information.  To  discard  areas  with  insufficient  luminance  gradient,  we 
use  the  selection  criterion  proposed  in  [18]  (see  Figure  2).  This  criterion  is  based  on 
the  eigenvalues  of  the  structure  tensor  T 
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T((u,v)T) 


f 


Vi; 

vivi 


VI  VI 
VI 


(15) 


where  [VI  ,  VI  ]  =  [3I() /9«,9lo /3v]  ,  the  two  eigenvalues  A,  and  A,  give 

information  on  the  intensity  profile  within  *««,.,/>.  Small  eigenvalues  are  associated 
with  a  constant  intensity  profile,  whereas  large  values  indicate  a  luminance  pattern 
that  can  be  successfully  tracked.  The  corresponding  feature  is  therefore  accepted  as 
minCAp/L,)  >  A  .  A  is  a  threshold.  The  corresponding  11x11  patch  with  the  feature 
point  as  its  center  is  stored  for  measurement  detection. 


Fig.  2.  Feature  points  deiecied  using  a  stereo  camera 


At  each  step,  we  use  only  those  features  that  fall  in  the  field  of  view  of  both  the  left 
and  right  camera.  Then  project  these  features  on  the  right  and  left  images.  A  matching 
search  based  on  normalized  cross-correlation  is  performed  using  the  patch  associated 
with  each  feature.  When  insufficient  features  are  visible,  new  features  arc  added  into 
the  state.  Moreover,  non-persistent  features  are  deleted  from  the  state  vector  to  avoid 
an  unnecessary  grown  of  the  feature  population. 

When  a  new  feature  x#  is  added  into  the  state 


(16) 


When  a  feature  x,  is  deleted  from  the  state 


X<- 

p,,  p,,„.  x 

X  = 

X^, 

p  = 

X 

lx  X  xj 
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3.4  Measurement  Equation 

At  each  step,  we  project  every  3D  feature  point  on  the  left  image.  A  match  is  detected 
after  performing  normalized  cross-correlation.  In  the  following,  a  new  measurement 
z  is  used  to  update  the  state  of  the  filter. 

First,  using  the  estimates  of  camera  position  r*u  ,  and  feature  position  x;  ,  the 
position  of  the  feature  point  relative  to  the  camera  x*  is  expected  to  be: 


(18) 


—  0]7  is  the  position  of  the  left  camera. 


where  RM  is  the  rotation  matrix,  r*. 


The  position  at  which  the  feature  point  x/  would  be  found  in  the  left  image  is 
calculated  according  to  the  standard  pinhole  model: 


(19) 


where  /  is  the  focal  length  of  the  camera,  (w0,v0)  is  the  principal  point,  su  and  sv  are 
the  camera  calibration  parameters. 

4  Experiments 

In  order  to  demonstrate  the  robustness  of  the  proposed  CLAM  system,  we  captured 
one  short  video  with  frame  rate  2()fps  by  a  stereo  camera  Bumblebee2  (See  Figure  3) 


Fig.  3.  Bumblebee2  stereo  camera 


The  stereo  camera  provides  a  100x84 degree  FOV  per  camera,  and  has  a  baseline 
of  12em.  Features  are  initialized  as  3D  points  which  are  less  than  7m  far  away  from 
the  camera.  With  the  help  of  the  Tricops  SDK  supplied  with  the  stereo  camera,  the 
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derived  video  is  rectified  so  that  in  the  CLAM  system,  the  effect  of  distortion  of  lens 
is  not  considered. 

The  video  is  captured  with  the  camera  in  hand.  It  is  processed  in  Matlab  with 
the  proposed  algorithm  on  a  laptop  with  an  Intel  4  processor  at  1.8  GHz  and  1G 
memory. 


(a)  (b) 

Fig.  4.  Two  frames  of  the  video  clip  and  the  results  of  the  CLAM  (a)  Frame  #2  (b)  Frame  #  1 22 

Figure  4  illustrates  the  results  of  the  proposed  CLAM.  In  order  to  show  the  CLAM 
is  robust  to  frame  losing  and  abrupt  turning  of  camera,  in  the  following  experiment, 
frames  from  #20  to  #40  are  not  used  in  the  CLAM,  which  means  that  #  41  is  to  be 
processed  after  #19  is  processed.  The  results  are  shown  in  Figure  5.  The  CLAM  can 
also  give  correct  estimate  even  when  some  frames  are  lost. 


(a) 


(b) 


Fig.  5.  Results  of  the  CLAM  in  the  case  of  frame  losing  (a)  Frame  #2  (b)  Frame  #122 


In  order  to  compare  CLAM  using  stereo  camera  and  SLAM  using  monocular 
camera,  we  take  the  all  the  left  images  from  the  video  clip  for  implementing  SLAM 
and  at  the  first  step,  the  stereo  images  are  used  to  supply  the  SLAM  with  scale 
information.  Table  1  compares  the  results  of  the  camera's  position  using  two 
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methods — CLAM  with  stereo  camera  and  monocular  SLAM.  From  the  results,  we 
can  see  that  CLAM  performs  better;  its  trajectory  is  closed  to  ground  truth,  especially 
during  an  abrupt  turn  or  frame-losing.  For  the  SLAM,  it  deviated  from  the  ground 
truth  when  it  processed  frame  #  41  due  to  the  incorrect  velocity  information  of  the 
camera  used. 

Table  1.  The  comparison  of  the  camera's  position  calculated  by  CLAM  and  SLAM  in  3  steps 


Frame 

Ground  Truth  (m) 

Estimated  (m) 

X 

V 

z 

X 

Y 

z 

#1 

0  00 

000 

000 

CLAM 

0  00  ±  001 

000  +  001 

0  00  +  001 

SLAM 

0  00  +  001 

000  +  001 

0  00  +  001 

#19 

051 

004 

0  50 

CLAM 

0  50+  002 

0  03  +  002 

0  50+  002 

SLAM 

0  52+  003 

0  04  +  0  0  3 

0  50  +  002 

#41 

1  41 

002 

004 

CLAM 

1  40  +  0  02 

0  02+  003 

0  04  +  0.02 

SLAM 

0.53  +  05 

003  +  0  4 

0  51  ±05 

5  Conclusions  and  Future  Work 

A  significant  contribution  of  this  paper  is  to  introduce  a  new  localization  and  mapping 
method  called  CLAM.  Compared  with  the  traditional  SLAM  approach,  the 
framework  of  the  CLAM  is  an  image -sequence-based  conditional  filter  robust  to 
occlusion  and  abrupt  changes.  According  to  robust  parametric  estimation  technique, 
the  velocity  of  the  feature  of  interest  can  be  derived  to  estimate  the  motion  of  the 
camera.  Normalized  cross-correlation  is  used  to  find  the  measurement. 

In  this  research,  a  stereo  camera  is  used  as  the  only  sensor,  the  nearby  features  are 
easy  to  initialize  and  provide  the  scale  information  to  the  3D  map.  The  close  features 
provides  distance  and  orientation  information. 

Currently,  the  CLAM  is  applied  in  a  short  video.  In  the  future,  we  will 
implement  the  CLAM  over  a  long  distance,  and  improve  it  so  that  it  is  robust  for  loop 
detection. 

As  an  extension  of  SLAM,  CLAM  currently  is  focused  on  building  3D  maps  of 
unknown  environments  with  feature  points,  which  can  be  easily  detected  and 
identified,  but  not  robust  against  occlusion  and  illumination.  In  indoor  environments, 
many  line  features  are  available,  such  as  the  edges  of  walls,  tables,  etc.  Lines  have 
various  advantages  over  points.  First,  lines  are  insensitive  to  illumination  and 
occlusion.  Second,  maps  of  line  segments  visualize  the  spatial  structures  of 
environments.  Third,  line  matching  can  be  achieved  even  when  viewpoint  changes 
occur,  but  point  features  can  only  be  reliably  matched  over  a  narrow  range  of 
viewpoints.  In  the  future,  wc  will  target  at  building  a  map  with  line  segments! 
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Abstract.  Most  previous  woks  on  web  news  article  extraction  only  focus  on  its 
content  and  title.  To  meet  the  growing  demand  for  the  various  wch  data  integra¬ 
tion  applications,  more  useful  news*  attributes,  such  as  publication  date,  author, 
etc.,  need  to  be  extracted  structured  stored  for  further  processing  In  this  paper, 
we  study  the  problem  of  automatically  extracting  multiple  new  s  attributes  from 
news  pages.  Unlike  the  traditional  ways(e.g.  extracting  news  attributes  separately 
or  generating  template-dependent  wrappers),  we  propose  an  automatic,  unified 
approach  to  extract  them  based  on  the  visual  features  of  news  attributes  which 
includes  independent  visual  features  and  dependent  visual  features.  The  basic 
idea  of  our  approach  is  that,  first,  the  candidates  of  each  news  attribute  are  ex¬ 
tracted  from  the  news  page  based  on  their  independent  visual  features,  and  then, 
the  true  value  of  each  attribute  is  identified  from  the  candidates  based  on  de¬ 
pendent  visual  features(the  layout  relations  among  news  attributes).  The  exten¬ 
sive  experiments  using  a  large  number  of  news  pages*  show  that  the  proposed 
approach  is  highly  effective  and  efficient. 

Keywords:  web  data  extraction,  news  attribute,  visual  feature. 


1  Introduction 

As  one  of  the  most  popular  web  information  sources,  web  news  articles  attract  count¬ 
less  surfers  every  day.  Meanwhile,  many  important  applications  need  an  efficient  way 
to  extract  news  articles  from  web  pages  at  fine  granularity  instead  of  indexing  the 
whole  pages.  Fig.l  shows  an  example  of  news  article,  and  the  attributes  of  this  article 
arc  also  marked  with  red  boxes. 

Extracting  news  articles  from  web  pages  automatically  is  always  a  very  challenging 
task  due  to  various  layouts  or  templates  of  news  web  pages.  To  the  best  of  our  know¬ 
ledge,  though  some  efforts  [1,2,13]  have  been  done  on  this  task,  most  of  them  only 
focus  on  extracting  content  and  title .  In  fact,  more  attributes  {publication  date ,  author , 
etc.)  also  need  to  be  extracted  to  meet  to  the  growing  demand  of  the  various  applica¬ 
tions.  Tabic  1  illustrates  the  functions  of  the  news  attributes  except  title  and  con- 
tent( because  their  functions  arc  widely  known. 
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In  this  paper  we  focus  on  8  important  news  attributes:  title ,  author ,  publication  date , 
content ,  category ,  source ,  related  news  links ,  comment  link.  Though  title  and  content  can 
be  extracted  with  good  performance  using  appropriate  features  (e.g.  Html  tag,  font  size, 
text  length,  etc.)  in  previous  works  [1,  13],  most  of  attributes  cannot  be  extracted  in  the 
similar  way.  For  example,  publication  date  would  be  difficult  to  be  identified  only  with 
its  own  features  if  many  dates  appear  in  the  news  page.  However,  a  user  can  still  identify 
it  without  any  difficulty  according  to  the  layout  relations  of  it  and  other  attributes. 

In  fact,  when  people  browse  a  web  page,  they  are  subconsciously  guided  by  the 
experience  they  have  accumulated  in  browsing  similar  web  pages.  Therefore,  to  ease 
users’  consumption,  the  news  page  designers  always  give  a  careful  consideration  on  the 
visual  features  of  news  attributes,  i.e.,  what  type  of  font  should  be  used  and  where  it 
should  be  placed  in  the  page.  Our  approach  simulates  how  a  user  understands  news 
attributes  in  news  pages  based  on  his  visual  perception.  In  this  paper,  the  visual  features  of 
news  attributes  used  in  our  approach  are  classified  into  two  types:  independent  visual 
features  and  dependent  visual  features.  Independent  visual  features  are  used  to  identify 
news  attributes  independently,  including  font,  text  length,  etc.  Dependent  visual  fea¬ 
tures  characterize  the  layout  relations  among  news  attributes  on  web  pages,  including 
direction  feature  and  neighbor  feature.  Section  2  will  introduce  these  visual  features. 

Based  on  the  visual  features,  we  propose  a  new  unified  approach  to  extract  news 
attributes,  which  is  different  to  traditional  ways  which  extract  each  attribute  indepen¬ 
dently  or  generate  template-dependent  wrappers.  The  proposed  approach  consists  of 
two  stages.  First,  several  candidates  of  each  attribute  are  extracted  from  the  news  page 
based  on  their  independent  visual  features.  Next,  the  true  value  of  each  attribute  is 
identified  from  its  candidates  based  on  the  dependent  visual  features.  A  prototype 
system  VEWNO  has  been  implemented  based  on  the  proposed  approach.  Though  8 
attributes  are  focused  in  this  paper,  our  approach  is  general  for  the  extraction  task  of 
any  attribute  set. 


}•  »  >  Economy.  |  category 

,, . , ~m  source 

-7T.  .TD  pubticalion  dale _ 

Jobless  total  passes  the  2  million  mark]  Me 


"2  author 


UNEMPLOYMENT  waD  (hit  wick  nte  ibw  the  tymboke  2m  lev 
m  the  fott  tent  in  1 2  yeert ,  eetiStmmg  the  impact  of  ih* 

coiiteut 

The  jovemminTs  pretened  mae-ure  of  unarnployment  Dated  or 
the  labour  Force  Survey  (IFS).  *1ood  at  t  97m  or  the  leleet 
etlimetee.  covermp  the  penod  to  the  end  at  Dec  ember  But  the 
•vinter  lot*  otjobe  re  e. peeled  to  have  pushed  <1  ebovt  ?m  m  the 
letetl  figures 

The  unemptoymem  total  hat  been  mmq  by  about  50.U0C  a  month 
'  evperte  tear  an  ac  eleielion  ae  the  racataion  Men  Garment 
unemployment,  which  hat  been  increating  latter  than  the  If  S 
measure  it  set  to  pa  up  by  between  7$jOOO  end  8G .HU.  puthmg 
Ihe  total  from  1  233m  to  more  than  t  3m  The  unemployment  rate . 
now 6  on  the  IFS  meeture.  »  eipected  to  haw  ckmbed  to 

D  5%  Anatyttt  think  it  will  peak  at  10% 


KHLATED  LINKS 

the  Cw  tt  emig  oci 
international  hutment  leaden  to 
retttl  prolectronitm 

related  links 

tk*  TMKltf? 


H»r»  lirahepvw 

(Men-Mi  wadi  pMflr 

m  M  irrwnti  i"Jnp*r 
bum  rnotafl  «i  Loro 
bKdtHim  0** 

■.  Hr  to  be 
TttgMPned  ITNfSa 


Need  to  know 
Too bunneet  ttonet, datty 
butinatt  date  end 
inter  at  kw  naetmec 


it  you  inOode  the  must  row  look  tor  work  but  »ea  never  pet 
emptoyory  to  employ  them  toe  >0%  on  16  toan  the  unemployment 
in  i*  a  9  mwon  gomg  io  I  nwiuon  In  J01D 

Erlr  ftOiOKT  uN  FnQUrvt 

l  u  .  .  u,~3 cnmmettl  link  <•*  * 


IunhLl i,etui«*fihk 

Teke  .TreeTnal  Today  Lataet 

AnelyVt 


Fig.  1.  An  example  of  web  news  article 
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Table  I.  The  functions  of  news  attributes 


News  attribute 

Function 

category 

classifying  news  article  on  topic 

related  links 

crawling  event-oriented  news  articles 

comment  link 

crawling  comment  pages  related  to  this  news  article 

publication  date 

ordering  the  news  articles  about  the  same  event  according  10  the  time 

author,  source 

evaluating  the  credibility  of  a  news  article 

Overall,  the  contributions  in  this  paper  are  summarized  on  four  aspects.  First,  we 
investigate  the  visual  features  of  news  attributes  (especially  dependent  visual  features) 
of  news  attributes.  To  the  best  of  our  knowledge,  the  existing  approaches  mainly  fo¬ 
cused  on  some  of  independent  visual  features.  Second,  we  propose  a  unified  approach 
to  extract  multiple  attributes  of  web  news  articles.  Intuitively,  extracting  more 
attributes  will  bring  more  challenges.  But  we  think  such  challenges  are  also  opportun¬ 
ities  because  more  informative  evidences  can  be  used  to  improve  the  extraction  per¬ 
formance.  In  other  words,  the  extraction  performance  of  an  attribute  can  be  improved 
by  the  layout  relations  of  it  and  other  attributes,  and  vice  verse.  Third,  our  approach  is 
also  the  combination  of  web  data  extraction  and  annotation.  That  is,  the  attributes  have 
been  assigned  the  right  semantics  when  they  are  extracted.  It  is  widely  known  that  web 
data  annotation  is  a  very  challenging  taskf  12].  Fourth,  the  basic  idea  of  our  approach  is 
not  limited  to  the  extraction  task  of  news  articles.  It  is  a  promising  way  to  extract 
multiple  attributes  of  web  object  simultaneously  from  noise-rich  web  pages.  Many 
structured  web  objects,  such  as  Blog  and  product,  can  also  be  extracted  in  the  same 
way.  As  the  future  work,  our  approach  will  be  applied  to  more  other  web  objects. 

The  rest  of  the  paper  is  organized  as  follows.  In  section  2,  we  introduce  the  visual 
features  used  in  our  approach.  Section  3  and  section  4  discuss  the  underlying  tech¬ 
niques,  attribute  candidate  extraction  and  true  value  identification,  of  our  approach 
respectively.  In  section  5,  wc  present  and  analyze  the  experimental  results.  The  related 
work  is  introduced  in  section  6,  and  section  7  is  the  summary. 

2  Visual  Features 

Web  pages  are  special  documents  being  accessed  through  a  web  browser.  The  visual 
information  are  the  most  important  clues  to  help  people  understand  the  semantic 
structures  of  web  pages.  As  a  result,  news  attributes  can  also  be  identified  with 


Fig.  2.  The  category  of  visual  features 
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appropriate  visual  features.  The  visual  features  of  news  attributes  inelude  independent 
features  and  dependent  features.  Fig.  2  shows  the  category  of  the  visual  features. 

Independent  features:  For  each  news  attribute,  some  visual  features  are  very  useful  to 
identify  it  in  news  pages.  For  example,  title  always  uses  a  notable  font  compared  to 
other  texts  on  web  pages.  We  call  these  features  independent  features  in  this  paper. 
Three  kinds  of  independent  features  are  listed  as  follows. 

•  Font  features:  size,  bold,  style,  color; 

•  Size  &  position  features :  height,  width,  coordinate^,  v)1; 

•  Text  features:  text  length,  link  text  length,  frequent  words,  expression  format 
(such  as  date). 

Further,  more  advanced  features  ean  be  derived  from  the  basic  features  above.  For 
examples,  the  area  of  a  text  block  can  be  calculated  with  its  width  and  height,  and  ratio 
of  link  texts  ean  be  calculated  according  the  text  length  and  the  link  text  length.  We  do 
not  list  all  the  features  due  to  the  limitation  of  paper  length. 

Dependent  features:  We  import  dependent  features  to  represent  the  layout  relations 
among  news  attributes  on  Web  pages.  According  to  our  observations,  the  layout  rela¬ 
tions  of  new  attributes  are  not  in  ehaos  though  the  templates  of  news  pages  are  various. 
We  classify  sueh  regularities  into  direction  feature  and  neighboring  feature. 

Direction  feature  indicates  the  direction  relation  among  attributes.  Since  a  web  page 
is  two-dimensionally  laid  out,  we  use  “top-down”  and  “left-right”  to  represent  this 
feature.  The  direction  relation  of  two  attributes  a \  and  ai  is  defined  as  below. 

Top-down:  ary  <  ayy\ 

Left-right:  a\.y  =  ayy  and  u,..v  <  ayx. 

These  relations  ean  be  deduced  easily  with  their  coordinates.  According  to  the  defini¬ 
tion,  “top-down”  relation  takes  precedence  of  “left-right”  relation,  and  so  it  is  im¬ 
possible  that  ax  is  both  on  top  and  on  the  left  of  a}  in  one  news  article. 

Neighbor  feature  represents  the  neighbor  relations  among  attributes.  For  example, 
author  and  content  are  often  neighbors  on  the  page.  a\  and  02  are  defined  to  the 
neighbor  relation  iff  no  other  attributes  appear  between  them  on  the  horizontal  or  ver¬ 
tical  direction.  Note  that,  noise  texts  does  not  influence  the  neighbor  relation.  For 
example,  in  Fig.  1,  the  related  news  links  and  the  comment  link  are  still  regarded  as 
being  neighboring  even  some  texts  are  inserted  between  them. 


title 

author 

content 

title 

11,0,38.5%) 

author 

C  (7/ lit  ill 

Fig.  3.  The  model  of  layout  relation  matrix 


1  The  origin  is  lhe  top-left  corner  of  a  web  page,  and  (x,  y)  is  the  top-left  corner  of  the  texi  block. 
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We  use  layout  relation  matrix  to  represent  the  layout  relations  between  any  two 
attributes.  Fig.  3  shows  the  model  of  the  layout  relation  matrix.  Each  cell  in  it  is  denoted 
in  form  of  triple  {/>{,  ph  pn}y  where  ph  p\  and  pn  arc  the  probabilities  for  top-down, 
left-right  and  neighbor  relations  respectively.  For  example,  the  cell  { 1,0,38.5%}  means 
the  probabilities  of  title  and  content  on  the  three  layout  relations  are  1,  0  and  38.5%. 
The  layout  relation  matrix  can  be  produced  using  labeled  news  pages.  We  observe  that 
the  probabilities  in  the  layout  relation  matrix  will  be  convergent  when  the  number  of 
news  pages  is  large  enough. 

3  Attribute  Candidate  Extraction 

Attribute  candidates  extraction  targets  at  extracting  some  text  blocks  from  the  news 
page  as  the  candidates  of  each  news  attribute  and  assure  the  true  value  must  be  one  of 
them.  In  our  implementation,  a  news  page  is  partitioned  into  a  set  of  text  blocks.  Any 
text  block  holds  a  rectangular  area  on  the  page,  and  the  visual  information  (font, 
coordinate,  etc.)  is  attached  the  text  blocks  during  this  process.  We  adopt  the  VIPS 
algorithm  [15]  to  build  the  visual  block  tree  for  a  news  page  and  collect  the  leaf  nodes 
as  the  initial  text  blocks.  To  ensure  attributes  and  text  blocks  arc  1:1  mappings,  we 
merge  the  text  blocks  as  one  block  if  they  share  the  same  font,  are  adjoining  on  the  page 
and  are  not  separated  by  recognized  separators  (such  as  “/")  1 7 ] . 

Table  2.  The  rules  for  attribute  candid  ale  extraction 


attribute 

Extraction  rules 

attribute 

Extraction  rules 

title 

45px>Font-size>  1 5px 

source 

Font-size<l2px 

Font-color:  black  or  blue 

Font-color:  black,  grey  or 
brown 

y<page.  heigh t/2 

Frequent-word;  “from  ",  etc. 

y<screen  height 

4  <Text-length<25 

8<Text-length<50 

content 

6px<  font-size<I2px 

isAnchorText:  no 

1  on t -color:  black 

publication 

date 

Font-size<10px 

y<screen  height 

Font-color:  black,  blue  or  grey 

text-length>20 

Text-length<  16 

category 

Font-size<  1 2px 

Text -format:  date  regular 

expressions 

y<page.  height/2 

isAnchorText;  no 

y<scrcen.  height 

author 

Font-size<l  2px 

8<Tcxt-length<30 

3<Text-length<25 

Frequent-word'  “>  ,  ,  “1  , 

etc. 

Frequent-word:  “author:”, 

“By”etc. 

related 

news 

links 

Font-si/e<12px 

comment 

link 

Font-si/e<l2px 

Font-color:  black  or  blue 

6  <Text-length<  15 

y>page.  height/2 

Frequent- word:  “comment" 

isAnchorText:  yes 

isAnchorText:  yes 

Frequent-word*  “related 

news”  “related  links",  etc. 
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3.1  Candidate  Extraction 

For  each  attribute,  some  text  hlocks  are  extracted  as  its  candidates  using  several  simple 
heuristic  rules  based  on  independent  features.  The  candidate  extraction  rules  are  al¬ 
ready  obtained,  which  are  just  the  rules  show  n  in  Table  2.  If  a  text  block  satisfies  all  the 
rules  of  some  attribute,  it  will  be  regarded  as  one  candidate  of  this  attribute.  In  this  way, 
a  group  of  text  blocks  are  extracted  as  the  candidates  for  each  attribute.  Now  we  will 
introduce  a  general  automatic  way  of  training  candidate  extraction  rules  using  labeled 
news  pages. 

3.2  Training  Candidate  Extraction  Rules 

To  build  these  candidate  extractors,  two  questions  have  to  be  answered:  given  a  news 
attribute,  which  visual  features  are  selected  to  generate  candidate  extraction  rules  and 
how  to  set  appropriate  values  for  the  rules? 

Visual  feature  selection 

For  each  news  attribute,  some  useful  visual  features  are  selected  to  generate  candidate 
extraction  rules.  For  example,  font  size  is  a  very  effective  feature  to  help  users  detect 
title  from  a  news  page.  Manually  selecting  visual  features  is  time-consuming  and  er¬ 
ror-prone.  The  classic  algorithm  C4.5  is  employed  for  this  task  because  it  can  select 
appropriate  features  and  use  them  to  build  the  classification  tree.  The  training  set  is  the 
text  blocks  obtained  from  news  pages  in  the  step  of  web  page  representation.  In  the 
training  set,  the  true  values  are  labeled  as  the  positive  samples  are,  and  others  are  la¬ 
beled  as  the  negative  samples.  When  the  classification  tree  for  each  news  attribute  is 
built  using  C4.5  algorithm,  the  features  in  the  classification  tree  are  selected. 


Cl 


c4 


C  5 


Fig.  4.  An  example  to  illustrate  the  neighbor  relation 


Candidate  extraction  rule  generation 

To  assure  the  true  value  must  satisfy  every  rules  of  the  attribute,  the  value  domain  of  a 
selected  feature  is  the  union  of  true  values  of  this  attribute  in  the  training  set  on  this 
feature.  For  example,  the  title  candidate  extraction  rule  “45px  > Font-size >  15px” 
means  the  range  of  title's  font  size  is  from  15px  to  45px  in  the  training  set.  When  the 
size  of  the  training  set  is  large  enough,  we  believe  the  candidate  extraction  rules  shown 
in  Tabic  2  arc  safe. 
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4  True  Value  Identification 

The  goal  of  true  value  identification  is  to  identify  the  true  value  from  the  candidates  of 
each  attribute.  In  this  section,  we  first  introduce  the  method  of  measuring  the  layout 
reasonableness  of  a  candidate  news  artiele,  and  then  propose  an  efficient  way  for  true 
value  identification.  We  define  candidate  news  article  to  be  the  candidates  from  dif¬ 
ferent  attributes,  and  define  true  news  article  to  be  all  of  its  candidates  in  it  are  true 
values.  Obviously,  true  news  article  is  more  reasonable  than  other  candidate  news 
artiele  on  the  layout. 

4.1  Measuring  the  Layout  Reasonableness  of  a  Candidate  New  Article 

We  define  the  layout  reasonableness  of  a  candidate  news  artiele  as  the  sum  of  the  layout 
reasonableness  of  any  two  candidates  in  it.  We  measure  the  layout  reasonableness  of 
any  two  candidates  based  on  the  layout  relation  matrix.  Given  any  two  candidates  c,  and 
q  belonging  to  different  news  attributes,  the  layout  reasonableness  of  them  is  calculated 
below: 


<p(c, 


‘•c')=(o*' 


r  Px(<*i>aj)  +  Htj  •  Pniai.dj) 
if  *  Pn(ai»a/)  —  0 


(1) 


where  at  and  a}  are  the  attributes  that  the  candidates  r,  and  c}  belong  to,  px(ava})  and 
PnU are  the  probabilities  of  C\  and  cs  on  direction  relation  and  neighbor  relation 
respectively  in  the  layout  probability  matrix.  p%(ava^  is  an  alternative  probability  which 
is  determined  by  the  current  direction  relation  of  c,  and  c}\  if  c}  and  Cj  satisfy  the 
top-down  relation,  px((i„0j)  is  otherwise  is  p\{a^a}).  and  //^  are  the  weights  of 

p%  and  pn.  For  different  attribute  pair,  the  related  A  and  p  are  also  different.  For  instance, 
according  to  our  observations,  the  direction  relation  is  more  important  than  the 
neighbor  relation  for  comment  link  and  title ,  while  the  neighbor  relation  of  comment 
link  and  title  is  more  important  than  the  direction  relation  of  them  for  comment  link  and 


Algorithm  of  Calculating/? 

Input:  two  candidates  c,  and  cf 

Output:  p 

Begin 

1  Initialize  set  Cx; 

Put  the  candidate  attributes  that  are  between  c,  and  c} 
into  Cx; 

For  each  ck £  Cx  do 

lc^CSkP(CltCk) 

2?=u*klcsll  ’ 

p=nCktcxp<sD 

Return  p\ 


P(Ck> 


5 

6 

End 


Fig.  5.  Algorithm  of  calculating  p 
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content.  We  ehoose  SVM  (Support  Vector  Machines)  to  obtain  the  appropriate  values 
for  and  automatically.  The  training  set  is  a  set  of  candidate  news  articles.  If  all  the 
attribute  candidates  in  a  candidate  news  artiele  are  true  values,  this  candidate  news 
artiele  is  labeled  as  the  positive  sample,  else  it  is  labeled  as  the  negative  sample. 

However,  it  is  difficult  to  determine  whether  two  candidates  are  really  neighboring 
when  some  other  candidates  are  between  them.  Just  like  the  example  shown  in  Fig.  4,  cx 
and  c5  are  top-down  relation,  and  three  candidates  (c2,  c 3,  c4)  are  between  them.  C\  and  <75 
are  neighboring  only  when  c2,  c3  and  c4  are  all  false  values,  otherwise  they  are  not. 
Because  any  candidate  cannot  be  determined  to  be  true  or  false  yet,  we  use  an  attenua¬ 
tion  faetor  [1  to  represent  the  probability  that  two  candidates  are  neighboring.  Therefore 
Formula  1  is  replaced  by  the  following  formula.  Estimating  p  will  be  introduced  soon. 


(2) 


Further,  the  formula  to  measure  the  layout  reasonableness  of  a  candidate  news  article  is 
given  below. 


(3) 


Estimating  p 

p  is  the  probability  of  ct  and  c}  being  neighboring  in  Formula  2,  i.e.,  the  probability  that 
all  candidates  between  c,  and  cy  are  false  values.  Suppose  CSk  is  the  candidate  set  of 
attribute  ak  that  ck  belongs  to,  and  its  si/e  is  n .  Intuitively,  the  probability  that  ck  is  a 
false  value,  denoted  as  p(t\=F),  should  be  l-(l//i).  But  it  is  not  reasonable.  For  example, 
title  is  always  on  top  of  content ,  so  any  candidate  title  must  be  a  false  value  if  it  is  under 
all  candidate  contents.  Therefore,  we  propose  a  more  effective  algorithm  (Fig.  5)  to 
ealeulate  p  based  on  the  direction  relation. 

Line  4  is  an  alternative  component  according  to  the  direction  relation  of  c\  and  cy  For 
example,  if  c\  is  on  top  of  fh(c\.t\)  is  replaced  by  pv(c\. c\)\  if  c\  is  on  the  left  of  t'k, 
px(c\.c k)  is  replaced  by  p\(c\.c\).  Obviously,  the  more  candidates  between  c,  and  cj  are, 
smaller  p  is. 

4.2  Efficient  Algorithm  for  True  Value  Identification 

A  straightforward  way  is  to  exhaust  all  possible  candidate  news  articles  and  measuring 
the  layout  reasonableness  of  them.  However,  the  total  number  of  candidate  news  ar¬ 
ticles  is  often  very  large,  i.e.,  IGSilIGSd..  ICS8I.  Note  that,  if  ICSjM),  it  is  removed  from 
the  formula.  Because  a  large  number  of  news  pages  are  processed  in  real  applications,  it 
is  inefficient  to  generate  and  measure  all  candidate  news  articles.  To  avoid  such  situa¬ 
tion,  a  simple  and  efficient  algorithm  is  proposed  to  find  the  true  /lews  article.  The 
algorithm  is  an  iterative  process:  first,  all  possible  partial  candidate  news  articles  with 
only  two  candidate  sets  CS\  and  CS2  are  generated;  next,  when  the  candidate  articles 
with  /-I  candidate  sets  have  been  generated,  the  candidate  news  articles  with  /  candi¬ 
date  sets  are  generated.  The  iterative  process  stops  until  1=8.  At  last,  the  optimal  can¬ 
didate  artiele  as  the  true  news  artiele  is  selected  with  Formula  2  from  all  candidate  news 
articles.  The  details  of  this  algorithm  are  shown  in  Fig.  6. 
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Algorithm  for  True  News  Article  Identification 

Input:  candidate  sets  of  all  attributes,  CS\ ,  CS2 . CS8. 

Output:  the  true  news  article  77VA. 

Begin 

1  Initialize  A CS=<J>  HACS  is  the  article  candidate  set 

//step  I :  generate  the  partial  article  candidates  with  two  candidate  sets 

2  For  each  cy^CS\  and  each  c ^  CS2  do 

3  Calculating  ^(tv,)  using  Formula  2; 

4  If  then 

5  Put  ^(r„r,)  into  ACS; 

/step  2:  extend  the  current  partial  article  candidates  in  ACS  through  adding 

cs} . csn 

6  For  /= 3  to  8  do 

7  For  each  AQ^AGSdo 

8  Take  ACk  out  of  ACS; 

9  For  each  CS,  do 

10  AC  =ACk; 

11  Add  (  j  into  CC’; 

12  If  (p{AC )>ff  then  Ha  is  the  threshold 

1 3  Put  A  C  into  A  CS; 

//select  and  output  the  optimal  article  candidate 

14  TNA  =  [AC  I  (p{AC)  is  max,  AC^ACS); 

15  Return  77VA; 

End 


Fig.  6.  Algorithm  of  true  news  article  identification 


There  are  two  steps  in  this  algorithm  In  step  1  (lines  2-5),  the  partial  candidate  news 
articles(only  containing  two  candidates)  arc  generated  with  CSi  and  CS2,  and  the 
none-zero  ones  are  putted  into  ACS.  In  step  2  (lines  6-1 3),  when  i-l  candidate  sets  have 
been  processed,  every  candidate  in  CSi  is  added  into  each  partial  candidate  article  in 
ACS.  If  the  value  of  an  extended  partial  candidate  artiele  is  smaller  than  the  predefined 
threshold  a,  it  will  be  removed  from  ACS  because  it  has  a  low  chance  to  be  the  optimal 
one.  Such  pruning  operation  can  reduee  the  size  of  ACS  greatly.  At  the  end  of  the  al¬ 
gorithm,  the  artiele  candidate  with  the  maximum  value  is  selected  as  the  optimal  one, 
and  the  attribute  candidates  in  it  is  regarded  as  the  true  values  of  news  attributes. 

5  Experiments 

To  evaluate  the  performance  of  our  approach,  we  have  implemented  a  prototype  system 
VEWNO.  The  input  is  any  web  news  page  that  is  well-displayed  in  web  browser,  and 
the  output  is  the  news  attributes  embedded  in  this  page.  The  eurrent  VEWNO  can 
process  one  news  page  in  0.32  to  0.57  seconds  (or  2-3  pages  in  one  second). 

5.1  Experiment  Setup 

The  data  set  used  in  the  experiments  includes  50  online  news  sites.  We  randomly  col¬ 
lect  at  least  100-300  news  pages  from  every  site.  We  divide  the  data  set  into  the  training 
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set  and  the  test  bed:  the  pages  of  25  sites  are  as  the  training  set,  and  the  pages  of  the 
other  25  sites  are  as  the  test  bed.  The  traditional  measures,  precision ,  recall  and  FI ,  are 
used  in  the  experiments  to  evaluate  VEWNO. 

5.2  Performance  Evaluation 

We  conduct  the  experiment  to  evaluate  the  perf  ormance  of  VEWNO.  The  training  set  is 
used  to  ( 1 )  learn  the  rules  for  attribute  candidate  extraction;  (2)  build  the  layout  relation 
matrix;  (3)  train  parameter  Ajj  and  //M;  (4)  estimate  fi. 

Table  3.  Extraction  performance  of  VEWNO 


title 

author 

source 

publication  date 

precision 

98.2  % 

91.5% 

94.9% 

97.9% 

recall 

95.0% 

84.3% 

90.3% 

96.1% 

FI 

96.2% 

87.8% 

92.5% 

97.0% 

content 

review 

category 

related  news  links 

precision 

96.4% 

96.3% 

97.7% 

95.8% 

recall 

93.7% 

93.1% 

95.5% 

92.2% 

FI 

95.0% 

94.7% 

96.6% 

94.0% 

Table  3  shows  the  experimental  results  of  VEWNO  on  testing  bed,  and  two  conclu¬ 
sions  can  be  made.  First,  the  performance  of  approach  is  very  good  on  both  8  attributes 
and  three  measures.  Compared  with  the  experimental  results  reported  by  [  1  ]  [5]  on 
content  and  title  respectively,  our  approach  is  better  or  very  close  to  them.  [13]  is  much 
better  than  ours  because  they  only  extract  the  minimum  sub  tree  that  contains  content 
not  the  exact  content.  Second,  our  approach  is  template  independent  because  the  tem¬ 
plates  of  the  pages  in  training  set  are  different  from  those  of  the  pages  in  testing  bed.  So 
VEWNO  can  perform  the  news  article  extraction  task  to  any  news  web  pages.  This  trait 
is  very  important  for  real  applications. 

5.3  Experiments  on  Visual  Features 

To  evaluate  the  effectiveness  of  the  dependent  visual  features,  we  implemented  8  ex¬ 
tractors  which  extract  8  news  attributes  from  news  pages  separately  and  compare  their 
performances  with  those  of  VEWNO.  Each  extractor  is  actually  a  classification  tree 
trained  by  SVM(LibSVM)  only  based  on  independent  visual  features( IV),  and 
VEWNO  can  be  viewed  as  the  combination  of  independent  visual  features  and  de¬ 
pendent  visual  features(DV). 

The  experimental  results  on  FI  measure  show  1V+DV  outperforms  only  IV  signif¬ 
icantly  on  most  attributes.  For  some  attributes  (title,  content  and  category),  the  per¬ 
formances  are  acceptable  only  using  the  independent  visual  features.  But  for  other 
attributes,  the  performances  arc  very  poor  because  their  independent  visual  features  are 
not  distinguishable  enough  to  other  texts  in  the  pages.  So  the  experiment  proves  de¬ 
pendent  visual  features  imported  by  us  are  very  effective  to  improve  the  performance. 
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■  IV  ■  IV+DV 


Fig.  7.  Comparison  experiments  between  IV  and  IV+DV 


5.4  Experiments  on  Comparison  with  CRF-Based  Approach 

CRF  is  the  state-of-art  model  for  specific  semantic  object  extraction.  We  have 
implemented  the  CRF-based  approach  proposed  in  [16]  which  is  elose  to  ours, 
and  make  the  comparison.  The  training  set  and  testing  set  are  same  to  those  of 
VEWNO. 


■  CRF-based  approach  ■  VEWNO 


Fig.  8.  Comparison  experiments  between  VEWNO  and  CRF-approach 


From  the  experimental  results  in  Fig.  8,  we  find  that,  though  the  of  CRF-based  ap¬ 
proach  has  better  the  performances  of  some  attributes,  it  is  still  no  match  for  VEWNO. 
The  reason  is  that,  though  CRF-based  approach  also  exploits  the  order  dependencies 
among  attributes,  it  overlooks  the  neighbor  feature  and  its  performance  will  be  poor  if 
too  many  false  ones  in  the  candidates.  For  example,  the  extraction  performances  of 
author  and  source  are  very  poor  beeause  the  direction  features  of  them  and  content  are 
not  strong  enough. 
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6  Related  Work 

The  problem  studied  in  this  paper  belongs  to  the  field  of  web  data  extraction.  It  has  re¬ 
ceived  a  lot  of  attentions  in  recent  years.  Survey  [8]  has  given  a  good  summary  for  these 
efforts.  The  research  efforts  in  this  field  are  either  template-dependent  [3,4,9,10,18]  or 
template-independent[6,7,14,17].  In  this  section,  we  give  a  brief  introduction  for  them 
first.  Then,  the  works  on  news  article  extraction  will  be  introduced  and  compared. 

Template-dependent  works  mainly  focus  on  extracting  structured  data  records  and  data 
items  in  the  web  pages  through  inducing  the  common  template.  Most  of  them  utilize  the 
structure  information  on  the  DOM  tree  of  a  web  page  to  represent  the  templates  of  similar 
web  data  records.  In  recent  works,  some  visual  features  are  also  combined  with  the  DOM 
tree  to  improve  the  performance,  such  as  the  method  introduced  by  ViNTs[4].  However, 
the  generated  wrappers  are  sensitive  by  those  works  can  only  be  applied  for  the  web 
pages  that  share  similar  templates,  and  are  not  practical  for  the  task  of  web  news  article 
extraction  from  general  web  sites.  In  addition,  an  annotation  task  is  needed  to  assign  right 
semantics  for  the  extracted  data.  Template-independent  works  aim  to  extract  structured 
data  from  different-template  web  pages.  Most  of  these  methods  are  based  on  probabilistic 
models,  which  integrate  semantic  information  and  human  knowledge  in  inference.  For 
example.  Conditional  Random  Fields  (CRF)[1 1]  and  its  variations(such  as  2D-CRF[6], 
HCRF[7]  and  Semi-Markov  CRF[  14])  infer  the  semantics  of  the  text  blocks  in  web  pages 
by  learning  the  order  dependencies  of  web  data  distribution.  The  distinct  advantage  of 
them  are  insensitive  to  the  templates  of  web  pages.  They  are  focusing  on  assigning  the 
semantic  label  to  the  extracted  data  and  can  be  seen  as  complementary  to  tem¬ 
plate-dependent  works  [6].  In  addition,  template-independent  works  are  not  suitable  for 
the  rich-noise  web  pages  (such  as  news  pages)  because  the  too  many  noises  will  signif¬ 
icantly  weaken  the  dependency  of  web  data  distribution. 

News  extraction  is  a  special  topic  in  the  field  of  web  data  extraction.  Until  now, 
several  works  have  been  proposed  for  web  news  article  extraction,  but  most  of  them 
only  focused  on  content  extraction.  [2]  proposes  a  top-down  approach  to  generate  a 
tree-structured  wrapper.  This  approach  is  template-dependent,  so  it  is  not  practical 
when  news  pages  come  from  different  web  sites.  [1]  is  a  template-independent  ap¬ 
proach  for  content  extraction  based  on  the  independent  visual  features.  But  for  most 
attributes,  such  as  publication  date  and  source ,  the  performance  will  be  poor  if  only 
their  independent  features  are  considered  (see  the  experiments  in  section  6.3).  Our 
approach  is  different  to  them  on  both  application  and  technique.  For  application,  our 
approach  targets  at  extracting  multiple  news  attributes  not  only  content.  For  technique, 
our  approach  utilizes  the  layout  relations  of  news  attributes  which  can  improve  the 
extraction  performances  of  news  attributes  in  a  unified  way. 

7  Conclusions  and  Future  Work 

In  this  paper  we  propose  a  unified  approach  to  extract  multiple  news  attributes  from 
news  pages  by  using  both  the  independent  visual  features  and  the  dependent  features. 
The  extensive  experiments  show  the  effectiveness  of  our  approach.  We  believe  it  is  a 
promising  way  to  improve  the  extraction  performance  by  exploiting  the  layout  relation 
among  attributes.  In  the  future,  we  will  try  to  perform  the  extraction  task  on  other  types 
of  web  objects,  including  web  Blog  and  detailed  product  object  in  web  pages. 
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Abstract.  Both  of  mobile  multimedia  and  mobile  Internet  are  the  im¬ 
portant  development  directions  of  the  mobile  service.  However,  it  would 
take  great  eost  by  using  the  high  data  transmission  rate  of  wireless  mul¬ 
timedia  communication  service.  Under  the  premise  of  not  increasing  the 
investment  in  hardware,  the  personalized  service  could  be  applied  to  the 
mobile  service  to  not  only  reducing  wireless  multimedia  communication 
eost  but  also  keeping  the  quality  of  mobile  service  for  users.  This  pa¬ 
per  presents  a  modeling  method  of  mobile  phone  user  profile  based  on 
Ontology.  This  paper  founds  a  model  in  a  method  of  spatial  graph  and 
introduces  the  theory  of  interval  valued  fuzzy  sets,  brings  forward  a  se¬ 
ries  of  correlative  definitions  and  formulae  of  founding  the  model  and 
designs  an  Algorit  hm  on  the  Spatial  Graph's  Establishment  and  Updat¬ 
ing.  Then,  it  also  studies  the  reasoning  technology  based  on  the  mobile 
phone  user  profile  and  presents  a  Reasoning  Algorithm  on  the  Mobile 
Phone  User  Profile.  It  should  be  considered  that  we  have  made  a  use¬ 
ful  attempt  on  the  study  of  founding  user  profile  and  forecasting  users’ 
possible  requirements. 

Keywords:  mobile  service,  interval- valued  fuzzy  sets,  user  profile, 
spatial  graph. 


1  Introduction 

At  present,  it  would  take  great  cost  by  using  the  high  data  transmission  rate  of 
wireless  multimedia  communication  service  because  the  communication  service 
needs  large  number  of  scarcity  resources  of  radio  spectrum.  Now,  the  personal¬ 
ized  service  is  an  important  study  direction  of  mobile  service.  Under  the  premise 
of  not  increasing  the  investment  in  hardware,  the  personalized  service  could  be 
applied  to  mobile  service  to  not  only  reducing  wireless  multimedia  communica¬ 
tion  cost  but  also  without  reducing  quality  of  mobile  service  for  users.  It  col¬ 
lects  users’  individuation  information  in  mobile  terminals,  analyses  and  forecasts 
users7  reqiiirmcnts,  and  then,  organizes  multicast  push  when  the  transmission 
load  of  Mobile  Network  is  light,  and  pushes  appropriate  information  resource 
gained  from  Internet  and  lays  in  these  mobile  terminals.  It  not  only  can  greatly 
reduce  the  cost  of  wireless  transmission,  but  also  pellucidly  offer  users  with  high¬ 
speed  wireless  network  resources  and  personalized  service  from  the  user's  point 
of  view. 
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2  Related  Works 

If  we  could  reasonably  forecast,  users’  personalized  requirements,  it  will  improve 
their  satisfaction  degree  and  also  is  the  sticking  point  whether  personalized  ser¬ 
vice  is  applied  to  mobile  service  successfully  or  not. The  study  of  user  profile  is 
the  base  and  core  of  the  personalized  service[l]. 

Some  researchers  have  introduced  the  user  model  into  the  mobile  communica¬ 
tion.  Researchers  of  University  Hannover  build  user  profiles  for  mobile  users 
and  capture  users’  movement  position  by  using  cell-ID  and  Wireless  Signal 
Booster[2].  Giuseppe  Araniti  integrates  user  profile  with  Quality-of-Service(QoS) 
and  studies  soft  QoS  mechanism  in  wireless  multimedia  resource[3].  In  the  30 
network,  the  researchers  of  University  of  California  build  real-time  user  group 
profile  and  reserve  resource  for  user  groups  to  provide  better  QoS  to  different 
classes  of  user.s(4].  Spyros  Panagiotakis  introduces  the  information,  such  as  loca¬ 
tion,  into  mobile  environment  to  better  determine  the  user’s  environment [5].  G. 
Bartolomeo  studies  how  to  build  user  profile  for  mobile  terminal  to  customize 
arid  obtain  services  safely [6].  Researchers  of  University  of  Toronto  collect  mo¬ 
bile  phone  user’s  preference  information  to  provide  customized  advertisements 
for  thein[7].  Under  the  mobile  environment,  researchers  of  Beijing  University 
of  Posts  and  Telecommunications  study  user  profile  when  the  mobile  terminal 
automatically  chooses  services  provided  by  the?  mobile  provider  and  the  user 
profile  is  based  on  Markov  decision  process[8].  As  a  National  Key  Technolog¬ 
ical  R&D  Program,  some  researchers  of  Zhejiang  University  advance  a  user's 
Smart  Shadow  model  in  the  pervasive  computing  environments  which  adopts 
Belief-Desire-lntention  model  to  found  user  profile[9]. 

rl  liese  researchers  mostly  focus  on  how  to  acquire  environmental  information 
of  the  users  in  mobile  Internet  and  customization  services.  On  the  base  of  our 
previous  workjlO],  we  would  discuss  how  to  establish  the  mobile  phone  user 
profile  and  forecast  the  users1  possible  requirements. 

3  The  Requirement  Issue 

Our  system  forwardly  pushes  useful  information  to  the  mobile  terminal  users  ac¬ 
cording  to  the  acquired  Internet  information.  The  two  sides  of  the  system  art)  the 
mobile  user  terminal  and  the  server-side.  The  work  principle  of  the  system  is  the 
following:  The  mobile  terminal  regularly  uploads  the  user  browsing  information 
to  the  server-side;  the  server-side  mines  the  user  interests  and  founds  the  minia¬ 
ture  user  profile  for  each  user.  At  the  same  time,  the  information  acquired  from 
Internet  is  provisionally  stored  in  the  corresponding  information  repositories, 
and  then,  is  drawn  out  in  batches  according  to  sonic  strategics.  The  tree  graph 
model  integrates  the  individual  requirements  of  the  mobile  terminal  users  with 
the  recommended  information  resources  on  the  Internet.  Finally,  users  who  have 
common  interest  will  be  organized  as  a  user  group  model  and  the  system  would 
push  corresponding  information  to  the  users  according  to  some  information  such 
as  location.  From  this  system’s  work  principle  described  above,  we  can  see  that 
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it  is  the  basis  and  key  on  judging  the  system's  success  or  failure  of  analyzing  and 
accurately  predicting  the  users’  real  requirements.  So  it  is  the  important  content 
research  of  establishing  mobile  phone  user  profile  and  vaguely  conferring  users’ 
requirements. 

We  study  a  model  method  of  mobile  phone  user  profile  based  on  Ontology  in 
this  paper.  The  reasoning  is  an  important  constituent  part  on  the  study  of  user 
profile.  We  believe  that  the  user  profile  is  only  a  tool  and  it  needs  to  combine  the 
reasoning  technology  to  improve  its  effectiveness.  So,  only  a  system  combining 
the  both  reasonably  could  predict  the  requirement  of  users. 


4  User  Profile  Establishment 

The  system  would  build  a  compact  user  profile  for  each  user  which  stores  the 
user's  interest  profiles.  It  is  not  a  general  description  of  a  user,  but  a  kind  of  user 
formal  description  of  algorithm-oriented  and  with  specific  data  structures.  The 
user  profile  is  composition  of  spatial  graph  based  on  Ontology. 

4.1  Expression  on  Node  and  Several  Formulae 

We  will  present  the  definitions  on  space,  sub  space  and  node  based  oil  Ontology 
as  following. 

Definition  1.  To  suppose  G  is  the  topological  space,  i.e.,  G  is  a  nonempty 
set.  If  some  subsets  of  G  are  defined  as  open  sets  and  they  meet  the  following 
conditions: 

1)  G  and  (j)  arc  open  sets; 

2)  The  union  of  arbitrary  number  of  open  sets  is  an  open  set; 

3)  The  intersection  of  limited  number  of  open  sets  is  an  open  set, 

Then,  these  open  sets  are  called  topological  structure  in  G  and  G  is  the  topo¬ 
logical  space. 

Definition  2.  If  Gf  C  (7,  then  Gf  is  the  topological  subspace  of  G. 

Definition  3.  A  four-tuples  is  used  to  express  the  node  based  on  Ontology,  ON 
(Md.R'AT.IS).  The  meaning  of  every  tuple  is  listed  as  follows. 

Aid:  The  meta  information  description  on  the  node. 

R:  Relation.  It  is  a  two-tuple.  (  Rr,Rw  )•  Rr  the  relations  on  Ontology  which 
includes  hypernym,  hyponym,  synonymy  and  antonym,  etc.  Rw  is  t  lie  words  that 
relative  to  the  relations  in  the  supported  Ontology  libraries.  The  words  can  be 
extended. 

AT:  The  attribute  of  the  node. 

IS:  The  instance  of  the  node. 

We  should  consider  the  establishing  and  updating  of  the  nodes  and  edges  of  the 
spatial  graph  based  on  Ontology.  Firstly,  the  system  would  cluster  the  user  data 
uploaded  from  the  mobile  terminals.  Then,  the  clustering  result  center  would  be 
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compared  with  nodes  of  the  spatial  graph  and  do  a  mapping  to  the  dosed  inter¬ 
val,  [0,1]  based  on  the  interval- valued  fuzzy  sets.  Because  we  seldom  acquire  an 
accurate  value  from  clustering  and  the  clustering  result  center  can  be  expressed 
in  an  interval-valued  fuzzy  set,  A(x)=  [A“(x),A+ (x)],xgG.  In  the  same  way,  the 
node  in  this  paper  ('an  also  be  expressed  with  B(.r,)=  [B“(x;),B+(x*)],x/  gG. 
The  i  is  a  certain  node  in  the  spatial  graph. 

Researchers  point  out  that  it  is  also  an  interval  value  of  the  corresponding 
topology  relation  degree  in  the  fuzzy  domain  if  the  fuzzy  domain  is  expressed  in 
two  interval- valued  fuzzy  sets|ll  .  So  the  problem  of  comparing  clustering  result 
center  with  nodes  ean  be  solved  in  the  intersection  degree  between  the  two  fuzzy 
domains.  It  is  showed  as  follows  of  the  Formula  on  Interval- valued  Fuzzy  Set 
Mapping  between  Clustering  Center  ami  Nodes: 

Pa  =  [  V  ~  w A  oj.  V  m+po  A  w»]  (•) 

jgg 


Pcl  expresses  the  comparative  value  of  clustering  result  center  and  nodes.  Pcl  G 
/].  The  [/]  refers  to  the  unit  closed  interval.  If  PCJ=[1,1],  then,  the  two  fuzzy 
domains  must  intersect.  If  Prt =[(),()].  then,  the  two  fuzzy  domains  must  disjoint 
absolutely. 

Here,  if  the  PCj  is  bigger  than  a  certain  threshold,  then,  the  information  rep¬ 
resented  by  the  clustering  result  center  would  belong  to  the  node  and  be  added 
to  it.  If  the  Pa  exceeding  the  threshold  of  more  than  one  node,  then,  it  would 
be  added  to  all  of  these  nodes  and  the  values  of  them  would  be  amended,  too.  If 
not  exceeding  any  threshold,  then,  the  clustering  center  would  be  added  to  the 
spatial  graph  as  a  new  node.  Then,  it  should  judge  whether  mount  a  new  edge 
or  not  between  the  new  node  and  each  old  node.  The  following  is  the  Formula 
on  New  Edge  Setting  Judging: 


5,  = 


-  •  arctan  1 7  ' 


Pf  =  p; 


2(Pf+ 


P;  ^Pr.Si  €/ 

P+  =  p, 


(2) 


To  suppose  Pci=[p,  The  St  should  be  related  to  the  two  factors:  0<  1 1  ■]/ i 
<1  and  0 <Pf  -  Pr<  1. 

Then,  it  would  mount  a  weight  value  on  the  new  edge.  To  judge  whether 
mount  a  new  edge  or  not  based  on  5,  gained  from  Formula(2)  and  to  endow  the 
new  edge  mounted  with  a  weight  value: 

fi  G  (0,1)  (3) 

The  fi  relates  to  the  two  factors,  Sx  and  .  There  is  f,  ,  Sj  and  VF,  changing 
with  the  same  tendency.  The  l  is  a  coefficient,  Q<f<\\\<  I ,  d£Rs  we  will  carefully 
choose  t  to  keep  f/Wj  in  (0,1).  The  //  is  a  coefficient,  too.  //>(),  / i G /? .  We  will 
also  carefully  choose  p  and  (  to  keep  fj  in  (0,1). 

Then,  we  would  consider  to  updating  the  weight  value  of  the  old  edges.  If  more 
than  one  old  node  is  amended,  then,  the  nodes  would  be  checked  whether  having 
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edges  among  them  or  not.  If  it  is,  the  weights  of  the  old  edges  would  be  modified, 
if  not,  a  new  edge  should  be  created.  If  only  an  old  node  is  amended,  then,  only 
its  edges  would  be  amended.  If  it  exceeds  a  certain  threshold,  a  new  edge  would 
be  created,  and  if  it  is  less  than  another  threshold,  then,  the  edge  would  be 
deleted.  At  the  same  time,  the  time  counter  of  the  edge  would  be  amended,  too. 
Thus,  it  is  accomplished  of  expressing  and  updating  on  the  spatial  graph. 

The  Formula  on  Time  Decay  of  Edge  is  considered  related  to  the  two  factors. 
One  is  the  time  from  the  edge  is  established  to  now.  The  other  is  the  prompting 
to  the  edge  in  this  time,  that  is,  it  plays  an  enhanced  role  of  using  the  nodes 
every  time.  The  Formula  on  Time  Decay  of  Edge  is  showed  as  following: 


W  = 


\l, 


t/  0 

t  =  0 


0) 


We  suppose  t  expresses  the  time  from  the  edge  is  founded  to  now  and  the  value 
of  W  should  reduce  along  with  t  increasing.  To  suppose  that  the  s  expresses  the 
prompting,  s£iV,  and  the  value  of  W  should  increase  as  soon  as  being  stimulated. 
To  suppose  the  W  is  the  weight  on  time  decaying  of  edge  and  it  is  a  function  on 
t  and  s,  VFg((),1].  The  k  is  a  coefficient,  k>  1 


4.2  Algorithm  on  the  Spatial  Graph’s  Establishment  and  Updating 

Algorithm  1.  Algorithm  on  the  Spatial  Graph’s  Establishment  and  Updating 
Input:D<:  the  useful  data  of  the  acquired  user's  behaviors 
Gr:  a  spatial  graph 

Output:  t lie  amended  spatial  graph 

1.  According  to  a  fuzzy  clustering  algorithm,  to  fuzzy  cluster  Dt  to  generate  a 
clustering  result  center  Dc. 

2:  for  all  i  :  i  £  [l..n]  do  //  There  are  n  nodes  in  Gr  which  would  be  compared 
with  Dc  one  by  one. 

3:  To  compare  Dc  with  Gr?  ,  then,  according  to  the  formula(l),  the  comparing 

result  would  be  mapped  to  7  based  on  the  theory  of  interval-valued  fuzzy  set 
/ /Gri  expresses  a  certain  node. 

//To  compare  Pcl  acquired  from  the  formula/ 1)  until  the  threshold ,  a  ,  to 
judge  whether  creating  a  new  node  or  amending  the  old  nodes.  It  is  showed  as 
follows. 

4:  if  Pcj>Q  then  //  If  it  exceeds  the  threshold,  a,  the  old  nodes  would  be 

amended. 

5:  if  only  one  node  has  PCi>o.  then 

6:  To  add  Dc  to  Grl  and  amend  the  value  of  Gr,  //If  exceeding  the 

threshold ,  a ,  Dc  can  be  considered  belonging  to  Grl  and  would  be  added  to  the 
node. 

7:  OldSideOne 

8:  end  if 

9:  if  more  than  one  node  have  Pcr>a  then 

10:  To  suppose  3i i,  bv  •  •,  and  ix/i?/-  •  •,  which  have  the  corresponding 

nodes,  Grn  ,Grj2^  •  • 
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1 1  :  To  add  Dc  into  Gr*i,GVi2v  •  •  one  bv  one  and  amend  the  values  of 

CrilA-i2v-. 

12:  OidSidcNofOne 

13:  end  if 

14:  else  //If  it  is  less  than  the  threshold ,  a,  a.  new  node  would  be  generated. 

15:  To  add  Dc  into  the  spatial  graph,  Gr,  as  a  now  node 

16:  To  generate  the  node  based  on  the  data,  Dt  ,and  the  node  definition 

17:  NewSitle 

18:  end  if 

19:  end  for 

20:  NewSi.de 

21:  for  all  i  :  i  €  [l..n]  do  //To  judge  whether  founding  new  edges  among 

a  new  node  and  each  old  node  or  not  based  on  the  formula(2).  It  is  showed  as 
follows. 

22:  To  compare  it  with  an  appointed  threshold  in  advance,  0  ,  based  on 

the  for  mi  da  (2). 

23:  if  Sj>0  then 

24:  It  mounts  a  new  edge  between  the  new  node  and  the  old  node. 

25:  To  calculate  the  weight  value  of  the  new  edge  and  assign  it  to  the 

new  edge  based  on  the  formula  (3). 

26:  It  mounts  the  direction  of  the  edge  according  to  the  relation  between 

the  two  nodes. 

27:  end  if 

28:  end  for 

29:  OldSideNot One  //To  amend  and  update  the  old  edges  between  these  nodes 
when  there  are  more  than  one  old  node  being  amended. 

30:  To  check  whether  any  edge  exists  among  t lie  amended  old  nodes,  Gr,i  %Gri2> 

. . .,  or  not. 

31:  if  any  edge  existing  then 

32:  To  amend  the  weight  of  the  old  edge  based  on  the  fonmila(4),  $:=s  T 

1.  Then,  to  calculate  the  amended  weight  value  of  the  edge  according  to  the 
formula(3). 

33:  else 

34:  To  establish  a  new  edge. 

35:  To  calculate  PCi  over  again  according  to  the  amended  old  node  by  using 

the  forrnula(l).  Then,  it  calculates  the  weight  of  the  new  edge  by  using  formulae 
(2),  (3)  and  (4). 

36:  end  if 

37:  OldSideOne  //To  amend  and  update  the  old  edge  of  the  node  when  there 
is  only  one  old  node ,  Gr{,  being  amended  in  Gr 
38:  It  amends  the  edges  which  the  amended  node  G>?  has  owned. 

39:  To  use  the  formula(4),  s:—s  -I- 1,  then,  use  the  formula (3)  to  get  a  weight 

value  of  the  amended  edge. 

40:  Return  Gr 
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4.3  Algorithm  Analysis 

There  are  five  important  characters  on  an  algorithm,  that  is.  input  data,  output 
data,  determinacy,  finiteness  and  effectiveness.  An  algorithm  should  be  feasible 
which  could  meet  the  five  characters  given  above.  There  is  the  input  data,  which 
are  Dt ,  the  gained  useful  data  on  the  users  behaviors,  and  Gr ,  a  spatial  graph, 
in  the  Algorithm  1  presented  in  the  paper.  There  is  also  the  output  data,  that 
is,  Gr' i  the  amended  spatial  graph.  The  effectiveness  of  an  algorithm  couldn't  be 
proved  with  a  better  means  in  theory  at  present  [13].  The  meaning  of  algorithm  ef¬ 
fectiveness  is  that  each  step  of  an  algorithm  should  be  executed  effectively,  that  is, 
operations  described  in  an  algorithm  could  realize  by  executing  limited  stops  of  ac¬ 
tualized  basic  operations.  If  the  algorithm  analysis  on  determinacy  and  complex¬ 
ity  is  based  on  each  sentence  and  elementary  operation,  then,  it  would  indirectly 
prove  its  effectiveness.  So  the  following  will  respectively  analyze  the  algorithm  1 
in  the  two  aspects:  determinacy  and  finiteness  that  is  mainly  time  complexity. 

Analysis  on  Algorithm  Determinacy 

The  determinacy  is  that  each  step  of  an  algorithm  should  be  certain.  The  algo¬ 
rithm  would  be  determinate  if  it  meets  the  well-ordered  principle!  12],  13]. 

Theorem  1.  If  a  clause  set  G  of  a  well-ordered  can  infer  X\  -<  Arn,  that  is,  X\ 
— »  XT},  then,  the  deduction  process  could  be  represented  as  G  (J  (AV~  A",,},  an 
insatiable  clause  set. 

Demonstration:  To  see  the  demonstration  on  theorem  3.35  in  reference  [13]. 

Theorem  2.  To  suppose  P  is  the  beginning  sentence  of  an  algorithm  and  Q  is 
its  end  statement.  If  an  algorithm  is  certain,  then,  the  P  — ♦  Q  can  he  inferred 
from  the  clause  set  G. 

Demonstration:  To  sec  the  demonstration  on  theorem  4.2.20  in  reference  [14]. 

Deduction  1.  If  an  algorithm  is  certain,  then,  it  could  be  represented  as  G  |J 
{ P,~  Q},  an  insatiable  clause  set. 

The  following  is  to  construct  a  clause  set  G  of  this  algorithm  and  analyze  its 
sentences: 

1 )  The  sentence  1  is  the  beginning  of  the  algorithm  which  is  expressed 
with  P\ 

2)  It  is  a  loop  structure  of  the  sentences  2-19.  Ilercinto,  the  sentence  3  is 
expressed  with  A  i  for  it  is  an  in-order  execution  relation  between  it  and  the 
following  sentences.  The  sentences  4-18  is  a  nested  branching  optional  structure 
which  embeds  IF  only  one  node  has  Pci  >  a  (the  sentences  5-8).  the  sentence  6 
and  7  are  respectively  expressed  as  A2  and  A3  for  they  are  an  in-order  execution 
relation;  and  IF  more  than  one  node  have  PCj  >  a  (the  sentences  9-13),  the 
sentence  10,  11  and  12  are  respectively  expressed  as  A4,  A  5  and  Ac,  for  they  are 
also  an  in-order  execution  relation;  ELSE  Pcl  <  a  (the  sentences  14-18),  and 
the  sentences  15,  16  and  17  are  respectively  expressed  as  A7,  As  and  A9  for  they 
are  also  an  in-order  execution  relation; 
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3)  The  sentences  20-28  is  the  invoked  New  Side:  The  sentence  20  is  an  entry 
function  and  is  expressed  with  Aio-  The  sentences  21-28  is  a  loop  structure. 
Hereinto,  the  sentence  22  is  expressed  with  An  for  it  is  an  in-order  execution 
relation  between  it  and  the  following  sentences  and  the  sentences  23-27  is  an  IF 
sentence  which  is  expressed  with  A 12; 

4)  The  sentences  29-3G  is  the  invoked  OldSide NotOne:  The  sentence  29  is 
an  entry  function  and  is  expressed  with  A 13.  The  sentence'  30  is  expressed  with 
A 14  for  it  is  an  in-order  execution  relation  between  it  and  the  following  sentences. 
The  sentences  31-3G  is  a  branching  optional  structure,  IF  (the  sentences  31-32) 
is  expressed  with  Ar,,  ELSE  (the  sentences  33-36)  is  expressed  with  /lie: 

5)  The  sentences  37-39  is  the  invoked  OldSidcOnt  :  The  sentence  37  is  an  en¬ 
try  function  and  is  expressed  with  ,4 17.  The  sentences  38  and  39  are  respectively 
expressed  as  As  and  A  19  for  they  are  an  in-order  execution  relation; 

G)  The  sentence  40  is  the  end  of  the  algorithm  which  is  expressed  with  Q. 

From  the  analysis  on  the  algorithm  sentences  above,  we  can  prove  its 
determ  inacy. 

Demonstration:  The  clause  set  of  the  algorithm  1 

G={(P  -  A|),V<=2f.li7(A  -  ^3. (At  -  AOA  (A>  -  AB),(A7  - 

A§)  A  (As  — ►  A)),Ai  — >  AUtAfi->  Aiz.Ac)  — »  Ao.(Aio  — >  An)  A  (An  — * ► 

A 1 2 ) » A 1 3  — >  Ai.i,\/.=j5 tlo(Ai4  —*  A),(Ai7  — ►  As)  A(Ais  — >  A19), 

V?  =  1 2, 1 5 , 1 G . H) ( A /  Q)} 

=  {~  P  V  A].~  A\  \J  A 1  \J  A*~  A\  \J  /I7 AV  A3.~  A  \f  A* 

~  /1 5  \J  A7  \f  As.~  A^  \JAi).  ~  A,\  \J  i4i7,~  A  \/  i4i3,~  Ay  \f  Ao* 
~  An)  V A 1 1 , ~  A\\  V  Ai2,~  74 13  \f  A\\  \J  As^  A4V  1  r> ~  A? 
V  Ais,~  As  V  A 1 9 , ~  A 1 2  V  Q'^  A15V  Q,~  An;  V  A 19  V  Q} 

Then,  it  ran  be  known  that  there  is  P  — »  Q  in  the  clause  set  G  according  to 
the  well-ordered  definition  and  the  Theorem  1.  that  is,  the  deduction  process 
could  be  represented  as  G  (J  {P.~  Q},  an  insatiable  clause  set.  In  line  with  the 
Theorem  2  and  Deduction  1  again,  it  is  proved  of  its  deterininacy. 

Analysis  on  Algorithm  Time  Complexitys 

To  suppose  the  Lo  —  tnax(\Dj\),j  =  1. 2, . . .  ,r,  the  D}  expresses  once  execution 
time  of  the  process  that  it  fuzzy  clusters  Dt  to  generate  a  clustering  result  center, 
Z)c,  according  to  a  fuzzy  clustering  algorithm.  To  suppose  the  L \  is  the  time  on 
amending  an  old  node  and  the  L 2  is  the  time  on  founding  a  new  node.  Then,  to 
suppose  the  L3  expresses  the  time  on  mounting  a  new  edge  for  a  new  node  which 
includes  mounting  ail  edge  and  gaining  its  weight  and  direction.  To  suppose  the 
L\  is  the  time  on  amending  and  updating  the  old  edges  which  have  been  owned 
by  the  amended  old  nodes  and  the  L5  expresses  the*  time  011  mounting  a  new 
edge  between  two  amended  old  nodes. 

lo  analyze  this  algorithm,  we  can  find  that  there  are  two  steps  in  its  imple¬ 
mentation  process:  the  first  is  the  pretreatment  on  the  spatial  graphics  which  is 
Dj  and  the  second  is  the  treatment  on  the  spatial  graph  which  includes  dealing 
with  both  nodes  and  edges,  that  is.  L 1,  L2,  L3,  £4  and  L5.  In  connection  with 
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a  certain  Dj ,  all  of  the  above  five  implementations  wouldn’t  appear  at  the  same 
time,  while  there  may  be  two  or  three.  Hcnee,  it  is  necessarily  greater  than  the 
actual  algorithm  complexity  of  the  calculated  algorithm  complexity  according 
to  the  five  cases  happening  at  the  same  time. 

®  The  instance  on  only  one  old  node  in  the  Gr  being  amended. 

There  are  n  nodes  in  the  Gr-  This  instance  includes  the  max  time  about 
the  pretreatment  on  the  spatial  graph,  selecting  and  amending  an  old  node  from 
the  Gr  and  amending  all  of  the  existent  old  edges  between  this  old  node  and  all 
of  the  other  old  nodes  in  the  Gr,  that  is,  Lo  4*  L\  4-  {71  —  1)L4; 

(D  The  instance  on  not  only  one  old  node  in  the  Gr  being  amended. 

This  instance  includes  the  max  time  about  the  pretreatment  on  the  spatial 
graph,  selecting  and  amending  more  than  one  old  node  from  the  Gr,  amending 
all  of  the  existent  old  edges  and  mounting  new  edges  between  every  amended 
old  node  and  all  of  the  other  old  nodes  in  the  Gr.  We  suppose  that  an  amended 
old  node  would  both  be  amended  old  edges  and  mounted  new  edges  once  each 
with  all  of  the  other  old  nodes.  In  fact,  both  of  them  are  relatively  prime,  so  we 
can  take  the  average  of  both,  that  is,  Lo  4-  n[L\  4*  Iiy^(L4  +  L5)]; 

®  The  instance  on  founding  a  new  node  in  the  Gr. 

This  instance  includes  the  max  time  about  the  pretreatment  on  the  spatial 
graph,  founding  a  new  node  and  potential  mounting  new  edges  between  this  new 
node  and  all  of  the  old  nodes  in  the  Gr,  that  is,  Lq  +  L2  4-  nL,*; 

®  The  time  on  completing  basic  operations  of  the  arithmetic  in  a  whole  exe¬ 
cution,  namely,  in  a  process  of  founding  and  updating  the  user  profile,  is  less  than 

Lo  4*  L\  4*  (u  —  1)L4  4*  Lo  4-  ti\L\  -4  ^  1  (L4  4*  Ls)]+Lo  4-  L24-71L3 
=•  3L0  4-  {n  4-  l)Lj  4-  L2  4-  71L3  4-  4 -  ti  —  2)L44-^(n2  —  u)L§ 

Over  here,  the  Lx(i  —  1 . 2, 3, 4, 5)  is  the  time  of  accomplishing  a  certain  basic  op¬ 
eration  which  can  be  regarded  as  a  constant.  The  Lo  =  max(\Dj\),  j  —  12 _ ,  r, 

is  the  upper  limit  of  pretreatment  011  the  spatial  graph  and  that  it  is  only  a  min¬ 
imum  probability  event  in  practice  of  the  Dj  could  gain  the  value  of  Lq.  so,  it 
ean  be  regarded  as  a  constant,  too.  Therefore,  the  equation  above  is  the  same 
order  with  n2.  If  it  is  noted  down  T(n),  then,  there  is  T(n)  =  0(n2).  Because 
the  T(n)  is  gained  when  Lx{i  —  1,2, 3, 4, 5)  appears  at  the  same  time  which  is 
surely  greater  than  the  actual  situation,  the  time  complexity  of  the  algorithm 
could  be  expressed  as  T(n)  =  o(n2). 

The  above  analyzes  and  proves  the  Algorithm  1  from  the  aspect  of  theory. 
Hence,  it  is  easy  to  see  that  the  algorithm  is  proved  of  the  traits  of  determinacy, 
effectiveness  and  low  time  complexity. 

5  User’s  Requirements  Reasoning 

After  establishing  a  user  profile,  the  system  needs  analyze  and  forecast  the  user’s 
actual  possible  requirements  and  termly  upload  to  the  system  based  011  the 
nodes  which  have  changed  in  a  cycle.  The  reasoning  background  is  acquired  as 
the  analysis  and  summary  on  the  user’s  online  behaviors  which  is  stored  in  the 
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user  profile.  That  is  to  say,  the  concrete  manifestation  of  the  background  is  just 
the  spatial  graph  based  on  Ontology  which  is  the  main  basis  when  the  system 
reasoning  the  user's  requirements. 

5.1  Reasoning  Mechanism 

To  suppose  the  changed  nodes  in  a  spatial  graph  is  Q,  then,  we  reason  the  Q  to  ac¬ 
quire  possible  results  according  to  a  user’s  actual  requirements  which  is  expressed 
as  Qf .  The  /  expresses  the  reasoning  process,  the  Q  is  regarded  as  the  reasoning 
antecedent  and  premise,  and  the  Q '  is  regarded  as  the  reasoning  consequent  and 
conclusion.  We  suppose  A[  as  a  link  of  the  reasoning  process,  that  is,  a  node  of  a 

certain  selected  reasoning  path.  Here,  the  /  6  12 . A;]  expresses  the  count  of 

a  certain  reasoning  path,  the  k  expresses  the  total  number  of  all  possible  paths  in 
the  reasoning  process,  the  i  E  T,  2,. . .  ,  n]  expresses  the  count  of  a  certain  node 
in  a  certain  reasoning  path  and  the  n  expresses  the  number  of  nodes  in  this  path. 
Then,  the  reasoning  process  can  be  expressed  as  follows: 

f  :Q=>Q' 

q— v<n3> 

Qf 

Among  them,  the  • — >  expresses  the  detrusion  symbol,  the  expresses  the  select 
symbol  which  refers  to  choose  a  certain  reasoning  path  and  the  J]  expresses  the 
orderly  path  of  the  nodes  in  a  certain  reasoning  path.  It  can  be  seen  that  the 
reasoned  Q*  maybe  is  not  uniqueness  relative  to  a  certain  Q  which  is  related  to 

u  - 

the?  choice  of  \  \  A\. 

i  i 

5.2  Reasoning  Algorithm  on  the  Mobile  Phone  User  Profile 

Algorithm  2.  Reasoning  Algorithm  on  the  Mobile  Phone  User  Profile 
Input:  Gr:  a  spatial  graph 

Output:  /,:  the  information  intersection  of  all  nodes  in  a  certain  selected  path 
1:  for  all  i  :  i  €  [l..n]  do  //There  arc  i  number  of  changed  nodes  in  Gr. 

2:  if  N-j  is  a  new  node  then 

3:  FindRoute 

4:  else 

5:  if  Pd  >  A  then  //If  the  value ,  Pcl .  of  the  node  Nt  is  bigger 

than  a  certain  threshold  value .  A.  appointed  in  advance ,  then  the  N,  t.s  just  the 
destination  nodes . 

(>:  FindRoute 

7:  end  if 

8:  end  if 
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9:  end  for 

10:  FindRoute  //To  find  the  optimal  path  as  following. 

11:  Starting  from  the  node  to  find  out  all  of  its  upper  nodes  j  E 

[1../7?].  /  €  [1  ..t]  and  the  inter-connected  directed  edges  among  these  nodes  S^\ 
k  €  [l..r]. 

//The  Ni  has  rn  number  of  upper  nodes.  There  are  r  number  of  directed 
edges  connecting  between  Ni  and  these  upper  nodes.  At  the  same  time,  the  l 
plays  part  in  layering  on  these  upper  nodes  and  directed  edges  in  the  light  of 
different  distances  from  these  nodes  to  Ni  and  there  are  t  layers. 

12:  begin 

13:  for  all  /  :  /  E  [1..J]  do 

14:  To  find  out  rnax(S[1^)  //  To  find  out  the  node  with  the  biggest  value 

among  the  directed  edges f  weight  in  a  same  layer. 

15:  To  note  down  the  corresponding  node  N^'  //  To  note  down  the 

corresponding  upper  node  of  the  edge. 

16:  To  continue  searching  the  next  from  the  upper  nodes  of  N^' 

17:  end  for 

18:  To  record  Ilt={Njl)'}  and  obtain  the  optimal  path. 

19:  To  draw  out  the  user's  informat  ion.  Nj1^' ,  contained  in  N^'  ,  the  nodes 

of  the  path. 

20:  /  E  [1  ,.t]  //To  draw  out  the  user  s  information  contained 

in  the  nodes  of  the  path  and  acquire  the  intersection. 

2 1 :  end 
22:  Return 

5.3  Algorithm  Analysis 

In  the  Algorithm  2,  there  is  the  input  data,  which  is  Gr,  a  spatial  graph. There 
is  also  the  output  data,  that  is,  /;,  the  information  intersection  of  all  nodes  in 
the  selected  path. 

Analysis  on  Algorithm  Determinacy 

The  determinacy  proof  of  the  Algorithm  2  is  as  same  as  the  Algorithm  1  and  we 
could  not  prove  it  in  detail  as  space  is  limited. 

Analysis  on  Algorithm  Time  Complexity 

For  the  same  reason,  we  couldn't  analyze  the  time  complexity  of  the  algorithm 
2,  either,  which  is  a  polynomial  expression  T(n)  =  0(u(Lq  +  L\  +  L2  +  L3)). 

6  Conclusion 

The  paper  focuses  on  how  to  found  mobile  phone  user  profile  and  forecast  users’ 
possible  requirements.  In  this  paper,  we  present  a  model  method  of  mobile  phone 
user  profile  based  011  Ontology  and  introduce  the  theory  of  interval  valued  fuzzy 
sets.  The  proposed  method  brings  forward  a  series  of  correlative  definitions  and 
formulae  on  founding  the  model  and  designs  an  Algorithm  on  the  Spatial  Graph's 
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Establishment  and  Updating.  Then,  we  also  study  the  reasoning  technology 
based  on  the  mobile  phone  user  profile  and  present  a  Reasoning  Algorithm  on 
the  Mobile  Phone  User  Profile.  For  the  future,  we  should  make  further  study  in 
depth  on  the  aspects  such  as  dynamic  user  group  model  and  prediction  accuracy 
measurement  on  user  group  model. 
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Abstract.  In  this  paper,  we  study  a  novel  problem  which  we  refer  to  as  News 
Website  Evaluation  (NWE).  Given  a  collection  of  news  articles,  NWE  is  pri 
marily  concerned  with  evaluating  the  importance  of  their  websites  with  respect 
to  specific  news  topics.  This  general  problem  subsumes  many  interesting  appli¬ 
cations  including  news  tracking  and  website  ranking.  To  solve  this  problem,  we 
first  propose  a  Topic-oriented  Website  Evaluation  Model  (TWEM)  which  ex¬ 
ploits  various  forms  of  information  and  combines  them  in  a  unified  computation 
framework.  Then,  considering  the  special  characteristics  of  news  articles,  we 
incorporate  an  article  merging  operation  into  TWEM  and  present  the  tnerge- 
TWEM  model.  ITie  experimental  results  show  that  the  proposed  models  perform 
significantly  better  than  competitive  baseline  systems,  and  can  serve  as  effec¬ 
tive  solutions  to  the  News  Website  Evaluation  problem. 

Keywords:  News  Website  Evaluation,  Website  Ranking,  News  Articles,  Web 
Mining. 


1  Introduction 

As  online  news  pages  are  accumulating  to  an  intractably  huge  size,  how  to  retrieve 
desired  information  has  beeome  an  increasingly  important  issue.  Under  sueh  circum¬ 
stances,  news  seareh  is  attracting  intensive  attention  from  research  community  and 
commercial  organizations.  Some  web  services  sueh  as  Google  News  have  been  able 
to  provide  users  satisfactory  ranking  results  of  news  pages.  However,  in  some  eases, 
we  are  also  interested  in  the  ranking  of  websites  on  speeifie  news  topics.  For  instance, 
we  have  no  idea  whether  CNN  is  more  authoritative  than  CBS  News  in  reporting 
“Copenhagen  Conference”,  although  from  Google  News  we  ean  find  the  important 
news  pages  on  this  topic. 

Page  ranking  has  been  a  traditional  focus  of  information  retrieval,  and  a  number  of 
methods  have  been  proposed  for  this  task.  However,  there  are  yet  no  mechanisms 
with  which  we  ean  rank  websites  aeeording  to  speeifie  news  topies.  In  this  paper,  we 
define  and  study  a  novel  problem  whieh  is  referred  to  as  News  Website  Evaluation 
(NWE).  Given  a  collection  of  news  articles1,  the  task  of  NWE  is  to  evaluate  the 


1  News  article  and  news  page  are  equivalent  concepts  in  this  study. 
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importance  of  the  websites  that  these  articles  belong  to.  This  problem  potentially  has 
many  application  scenarios.  A  typical  example  involves  tracking  analysis  of  news 
reports  on  the  web.  For  recognizing  the  propagation  pattern  of  news,  we  are  more 
interested  in  the  spreading  process  among  different  websites  rather  than  pages.  In  this 
situation,  a  key  factor  which  has  to  be  considered  is  the  relative  importance  of  web¬ 
sites  on  this  news  topic. 

As  for  solutions  to  this  problem,  we  first  propose  a  Topic-oriented  Website  Evalua¬ 
tion  Model  (TWEM).  To  achieve  desirable  performance,  TWEM  takes  advantage  of 
various  forms  of  information.  Specifically,  TWEM  considers  interdependency  between 
websites  and  news  articles,  as  well  as  mutual  support  among  news  articles.  In  addition, 
the  inherent  popularity  of  websites  is  also  considered  when  we  infer  their  final  impor¬ 
tance  scores.  Then,  we  adapt  the  TWEM  model  to  the  special  features  of  news  articles 
by  introducing  an  article  merging  operation.  Article  merging  aims  to  merge  similar 
news  articles  into  super-articles.  We  propose  another  model,  named  merge-TWEM,  to 
combine  TWEM  and  article  merging  together.  Wc  conduct  extensive  experimental 
studies  to  test  the  proposed  models.  Experimental  results  on  the  real  dataset  show  that 
both  TWEM  and  merge-TWEM  outperform  the  baseline  systems  to  a  great  extent. 
Moreover,  performance  comparison  reveals  that  merge-TWEM  achieves  better  results 
than  TWEM,  and  thus  demonstrates  that  the  article  merging  operation  indeed  takes 
effeets  in  boosting  the  performance  of  TWEM. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  reviews  previous  work 
which  is  related  with  this  study.  Section  3  presents  the  proposed  models.  In  section  4, 
we  give  and  discuss  the  experimental  results.  We  have  the  conclusion  and  future  work 
in  Section  5. 

2  Related  Work 

2.1  Page  Ranking 

Page  ranking  is  a  well  studied  problem  in  information  retrieval  and  web  mining.  Pag- 
eRank  [7]  and  H1TS[1]  arc  two  well  known  models  for  ranking  web  pages  based  on 
link  analysis.  Besides  link  structures,  topical  information  has  been  exploited  for  de¬ 
signing  more  sophisticated  ranking  models.  Haveliwala  et  al.  [8]  proposed  the  Topic- 
sensitive  PageRank  model  to  combine  topical  analysis  with  PageRank,  In  this  model, 
some  topics  are  selected  from  predefined  categories  and  a  biased  PageRank  vector  is 
computed  for  each  topic.  The  final  score  for  each  page  is  got  by  summing  elements  of 
all  the  PageRank  vectors  pertaining  to  different  topics.  Chakrabarti  et  al.  [10]  ex¬ 
ploited  the  anchor  texts  of  hyperlinks  to  assign  each  hyperlink  a  topical  weight.  This 
topical  weight  is  employed  in  the  computation  process  of  HITS.  Bharat  et  al.  [9]  gave 
each  node  in  HITS  a  relevanee  weight  which  is  defined  as  the  similarity  of  this  node's 
document  to  the  topic  query.  This  relevance  weight  is  used  to  regulate  hub  and  au¬ 
thority  scores  computed  by  HITS.  Nie  et  al.  [11]  proposed  the  Topical  HITS  and 
Topical  PageRank  models  in  which  topical  information  is  incorporated  into  HITS  and 
PageRank  in  a  probabilistic  way.  For  each  page,  they  calculated  a  score  vector  to 
distinguish  the  contributions  from  different  topics.  Their  models  outperform  other 
approaches  and  me  keep  the  characteristics  of  the  basic  HITS  and  PageRank 
unchanged. 
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Other  forms  of  information  are  also  explored  for  web  ranking.  Wang  ct  al.  [12] 
proposed  to  use  media  focus  and  user  attention  information  to  rank  news  topics  within 
a  certain  news  story.  Fernandes  et  al.  [13]  proposed  to  use  block  information  in  web 
pages  to  get  better  ranking  results.  Dou  et  al.  [14]  introduced  methods  which  incorpo¬ 
rate  anchor  texts  into  web  search  and  achieve  better  retrieval  performance.  Guo  et  al. 
[15]  built  a  Bayesian  based  click  chain  model  (CCM)  for  mining  web  search  click 
logs,  which  is  helpful  for  improving  the  results  of  web  search. 

2.2  Evaluation  of  Websites 

Another  line  of  related  work  focuses  on  conducting  evaluations  on  websites.  Inspired 
by  the  idea  of  HITS,  Yin  et  al.  [2,  3]  proposed  the  TRUTHFINDER  model  for 
identifying  trustworthy  websites  and  correct  facts  on  the  web.  Based  on  the  interde¬ 
pendency  between  websites  and  facts,  an  iterative  computation  method  is  used  to 
calculate  the  trustworthiness  of  websites  and  correctness  of  facts.  Dai  et  al.  [6]  pro¬ 
posed  a  trust  model  to  determine  the  trustworthiness  of  data  providers  (websites). 
They  made  use  of  various  features,  such  as  data  similarity,  data  conflict,  path  similar¬ 
ity  and  data  deduction,  for  calculating  the  trust  scores  of  websites.  Although  these 
models  are  proved  to  be  effective  in  deciding  whether  a  website  is  trustworthy  on  a 
subject,  they  are  unable  to  provide  information  about  whether  a  website  is  important 
on  a  specific  topic,  e.g.,  a  news  story. 

Liu  et  al.  [4]  proposed  a  BrowseRank  model  to  exploit  user  browsing  behavior  data 
for  ranking  web  pages.  When  ignoring  the  transitions  between  pages  in  the  same 
website,  BrowseRank  can  give  the  ranking  results  for  websites.  Zhu  et  al.  [5]  intro¬ 
duced  the  ClickRank  model  for  estimating  web  page  and  website  importance  from 
browsing  information.  In  their  model,  the  score  for  a  website  is  the  sum  of  ClickRank 
values  of  its  web  pages.  Gao  et  al.  1 16]  designed  a  model  to  compute  the  weights  of 
websites  and  web  pages  at  the  same  time.  However,  they  ignored  critical  factors  such 
as  website  popularity  and  web  page  merging. 

3  The  Proposed  Models 

In  this  section,  we  present  two  models,  i.e.,  TWEM  and  merge-TWEM,  for  the  NWE 
problem.  Before  going  to  the  details,  we  first  give  some  formal  definitions.  For  a 
news  topic  /  ,  wc  denote  the  set  of  news  articles  as  A-[at}  whose  websites  comprise 
the  website  set  VV={vy(} .  Since  several  articles  can  belong  to  the  same  website,  we 
have  \a\  >  |W|  . 

We  define  the  topical  importance  of  vv  ,  denoted  as  imp{ u;) ,  is  a  value  between  0 
and  I  which  indicates  its  relative  importance  on  the  topic  t  .  The  higher  this  value,  the 
more  important  u; .  Then  the  task  of  the  NWE  problem  can  be  further  formulated  as 
inferring  topical  importance  of  each  website  in  vvr,  and  ranking  these  websites  accord¬ 
ing  to  their  topical  importance  values. 
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3.1  TYVEM 

The  TWEM  model  mainly  consists  of  two  steps:  dynamic  computation  and  popularity 
incorporation.  In  this  subsection,  wc  describe  them  in  detail. 

Dynamic  computation.  For  the  news  topic  t  ,  there  are  usually  a  large  number  of 
articles  related  with  it.  It  is  clear  that  several  news  articles  can  belong  to  the  same 
website.  Also,  we  assume  in  this  study  that  one  single  article  can  also  belong  to  mul¬ 
tiple  websites.  In  order  to  verify  this  point,  we  give  an  example  in  Figure  1.  The  news 
artiele  at  is  on  website  vv, ,  and  the  article  a2  (on  another  website)  has  issued  impor¬ 
tant  information  on  this  topic.  Besides  its  own  contents,  a]  probably  gives  a  hyperlink 
to  a2 ,  which  is  a  common  ease  in  news  articles  and  websites.  In  this  situation,  the 
article  a2  can  be  viewed  to  be  contained  in  w,  as  well,  in  the  sense  that  «2  can  be 
accessed  from  u, .  Based  on  this  assumption,  we  can  conclude  that  there  is  actually  a 

“many-to-many”  mapping  between  websites  and  articles.  An  interdependency  rela¬ 
tionship  exists  between  websites  and  news  articles:  an  article  (like  a ,)  is  considered 

to  be  important  if  it  belongs  to  many  important  websites;  a  website  (like  vv,)  is  con¬ 
sidered  to  be  important  if  it  contains  many  important  articles.  Then,  an  iterative  com¬ 
putation  method  like  HITS  [1  ]  can  be  used  to  calculate  the  scores  of  the  websites.  The 
websites  and  articles  are  organized  into  a  bipartite  graph,  where  the  hub  nodes  are 
websites  and  authority  nodes  are  news  articles.  There  is  a  link  from  website  h;  to 

artiele  ut  if  at  belongs  to  vv  . 

Besides  relations  with  websites,  news  articles  themselves  have  influence  on  each 
other.  Intuitively,  if  the  contents  of  one  artiele  are  similar  with  many  others,  this  arti¬ 
cle  can  be  considered  to  be  supported  by  others,  and  thus  should  be  assigned  a  higher 
importance  score.  The  support  from  article  a,  to  article  a  ,  denoted  as  sup(/,y),  is 

defined  as  the  amount  of  importance  that  should  be  added  to  a,  if  we  know  at  is 

important.  Then,  the  score  of  an  article  comes  from  two  sources:  websites  as  hubs  and 
support  from  other  articles.  This  idea  is  captured  by  the  model  presented  in  Figure  2. 
The  authority  scores  for  the  articles  and  the  hub  scores  for  the  websites  arc  calculated 
iteratively  as  follows. 
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Fig.  1.  An  example  of  the  “many-to-many”  relationship  between  websites  and  news 
articles 
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2>^'(  wv) 

Auth’*'(at)  =  — - h  V  sup(A.i)  •  Auth"(u , ) 

Auth"  (a  j) 

Hub"*'  ( n; )  =  Hub"  ( vv )  +  -£! - ,  ( 1 ) 

.y 

where  Auth"(a{)  is  the  authority  score  for  artiele  at  in  ntb  iteration,  Hubn{\vt)  is  the 
hub  seore  for  website  wi  in  nh  iteration,  r  is  the  number  of  websites  containing  a, , 
and  v  is  the  number  of  articles  belonging  to  vr  .  sup (k,i)  represents  the  support  from 
ak  to  at ,  and  is  defined  in  the  follow  formula. 


sup(£,0 


sim(ak ,  (i ) 


(2) 


where  sim(ak,ai)  is  the  text  similarity  between  artiele  ak ^  and  artiele  a,  .  From  a  sto¬ 
chastic  perspective,  sup(A,?)is  actually  the  probability  of  transiting  from  ak  to  at  on 
the  graph.  To  avoid  self-transition,  we  define  sup(/,/)=0.  After  eaeh  iteration,  we 
transform  hub  and  authority  weights  using  the  function  f(x)  =  l-exp(-.v)  in  order  to 
smooth  the  values,  and  to  keep  the  weights  between  0  and  1. 

From  the  above  computation,  we  can  see  that  the  authority  or  hub  scores  are  up¬ 
dated  many  times  until  the  iterative  process  converges.  This  is  why  we  call  this  step 
dynamic  computation.  The  convergence  is  achieved  when  the  difference  between  the 
seores  computed  at  two  successive  iterations  falls  below  a  given  threshold. 


t;)  < 

Htibiw2)  < 
Hnh{M3)  < 


Popularity  incorporation.  In  addition  to  the  hub  seores  derived  from  their  articles, 
websites  themselves  have  popularity  whieh  is  independent  of  specific  topics.  Popular¬ 
ity  is  the  extent  to  whieh  websites  are  popular  among  the  publie,  e.g.,  Yahoo  is  a 
mueh  popular  portal  website.  Intuitively,  if  a  website  is  quite  popular,  users  are  more 
likely  to  issue  important  information  on  it  in  order  to  attraet  the  attention  of  others. 
The  accumulation  of  important  articles  in  turn  makes  this  website  important  on  cer¬ 
tain  news  topies.  Therefore,  the  popularity  of  websites  also  has  an  influence  on  their 
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topieal  importance.  A  possible  way  to  measure  popularity  quantitatively  is  to  utilize 
the  website  ranks  from  Alexa  .  Alexa  provides  detailed  ranks  of  websites  based  on 
page  view  and  traffic.  If  the  Alexa  rank  of  the  website  w  is  aRank(n  ) ,  its  popularity 
is  defined  as 


Pop(  w;  )  =  !.()- 


aRauk(  vv ) 

max  ( aRank{w] \aRcmk  (  w,  ),• 


-\aRank(\\\, )) 


where  N  is  the  total  number  of  websites  in  the  website  collection  W  . 


(3) 


Final  topical  importance,  we  combine  the  hub  score  and  popularity  together,  and 
obtain  the  topical  importance  for  website  u  ,  that  is, 

imp{wj)  =  a-  Pop(\ vf)  +  (l  —  a) •  //w/?(vr ) ,  (4) 

where  a  is  a  factor  to  control  the  balance  between  the  popularity  and  the  hub  score. 

3.2  merge-TYVEM 

News  articles  commonly  cite  contents  from  each  other,  especially  from  authoritative 
sources,  e.g.,  some  news  agencies.  An  example  is  presented  in  Figure  3  where  the  two 
news  articles  have  identical  contents  which  are  originally  released  by  the  Associated 
Press.  This  citation  makes  news  articles  on  the  same  topic  usually  show  great  similar¬ 
ity.  Because  of  the  support  among  articles  in  dynamic  computation,  a  group  of  very 
similar  articles  will  prompt  each  other  and  obtain  unfairly  high  scores. 
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Fig.  3.  An  example  of  citation  among  news  articles 


To  address  this  problem,  we  propose  a  merge -TWEM  model  which  extends 
TWEM  to  include  an  article  merging  operation.  In  particular,  if  the  similarity  between 
two  articles  exceeds  a  predefined  threshold,  we  merge  them  into  a  single  super¬ 
article.  A  super-article  is  an  article  group  in  which  every  two  members  have  similarity 
over  the  threshold.  After  this  merging  is  conducted,  we  get  the  set  of  super-articles 
SA  =  {«?,.} ,  where  each  sa.  is  associated  with  a  group  of  members  (articles),  i.e.. 


"  www.alexa.com 
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SO;  ,  (5) 

where  ai}  is  the  fh  member  in  super-article  sat ,  |.vc/J  is  the  total  number  of  members 
contained  by  sa,  .  A  super-article  belongs  to  all  the  websites  of  its  members.  Figure  4 
gives  a  description  of  the  merge-TWEM  model. 

Then  the  iterative  computation  process  for  authority  scores  of  super-articles  and 
hub  scores  of  websites  can  be  formulated  as  follows. 

Aitfhn*\sai)  =  — - +  77?s*up(/c,/)* Authn(sak) 

j <»,  €  SA 

Y^Auihn{sai) 

Huh”*'  ( M' )  =  Huh"  ( u; )  +  — - ,  (6) 

V 

where  Anth'^sa^  is  the  authority  score  for  super-article  sat  in  nlh  iteration,  Hubn(\ tv)  is 
the  hub  score  for  website  vr  in  n,h  iteration,  ^  is  the  number  of  websites  that  sa  be¬ 
longs  to,  and  v  is  the  number  of  super-articles  that  u(  contains.  We  define  the  similar¬ 
ity  between  two  super-articles  as  the  average  of  the  similarity  values  among  all  their 
members,  that  is. 


sirn(  sai ,  sa . ) 


ZV  siin(a  , a  •  ) 

j  Im  J*  ' 


l.vrt  1 

1  <1 

(7) 


Accordingly,  //7sup( A , y)  ,  which  means  the  support  from  super-article  sak  to  super- 
article  sat ,  is  represented  as 


777  SUp(A,/) 


sim(sak,sat) 

Z.V7777(  SCI. ,  SQ  ) 

saeSA  '  J7 


(8) 


and  we  define  msup(?\i)  =  0.  Similar  with  TWEM,  the  final  hub  scores  are  combined 
with  popularity  to  get  the  topical  importance  for  websites.  Note  that  the  primary  aim 


Hnb{n\)  < 

Hub{w2 )  < 
Huh  ir3 )  <* 


Fig.  4.  The  merge-TWEM  model 
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of  our  models  is  to  rank  websites,  though  we  make  great  efforts  (e.g.,  consideration  of 
article  support  and  adoption  of  article  merging)  on  modeling  relations  among  news 
articles. 

4  Experiments 

In  this  section,  we  eonduet  experimental  studies  to  evaluate  the  effectiveness  of  the 
TWEM  and  merge-TWEM  models.  Before  going  to  the  details,  we  first  describe  the 
dataset  and  evaluation  method. 

4.1  Dataset  and  Evaluation  Method 

There  are  no  benchmark  datasets  for  evaluating  our  proposed  models.  In  our  experi¬ 
ments,  we  select  ten  testing  eases  which  have  been  hot  news  topics  during  the  last  two 
years.  For  each  testing  ease,  we  submit  a  representative  query  to  Google  News  and 
download  the  first  400  articles  which  form  the  article  collection.  The  websites  which 
are  extracted  from  the  URLs  of  the  articles  form  the  website  collection.  More  details 
about  this  dataset  can  be  found  in  Appendix  1. 

Both  TWEM  and  merge-TWEM  are  run  on  this  dataset.  For  eaeh  topic,  these  two 
models  output  websites  which  have  been  ranked  according  to  their  topical  impor¬ 
tance.  We  evaluate  the  results  using  a  manual -scoring  strategy,  which  consists  of  four 
steps  as  follows. 

Step  /:  The  websites  in  the  results  are  divided  into  4  groups  by  assessors  according 
to  their  ranks  and  topical  importance  generated  by  our  computation  model.  Each 
group  is  marked  as  “Very  Important",  “Important",  “Unimportant"  and  “Very  Unim¬ 
portant",  respectively.  Then  each  website  gets  an  importance  level  accordingly.  This 
level  can  be  viewed  as  the  “classification"  result  of  the  models. 

Step  2:  From  each  group,  we  select  randomly  10  websites.  These  40  selected  web¬ 
sites  are  used  as  the  evaluation  samples. 

Step  3:  Three  assessors  brow  se  the  articles  of  each  sample  and  assign  it  a  score  (be¬ 
tween  0  and  4)  independently.  Then  this  sample  is  given  another  importance  level 
according  to  the  average  of  the  three  scores.  This  level  is  the  ground-truth  result  for 
this  sample. 

Step  4:  After  comparing  the  importance  levels  assigned  in  Step  1  and  Step  3,  the 
Evaluation  Sample  Precision  (ESP)  can  be  calculated  as  the  proportion  of  evaluation 
samples  whose  importance  levels  in  Step  1  and  Step  3  are  identical  to  the  total  num¬ 
ber  of  evaluation  samples. 

ESP  measures  whether  TWEM  or  merge-TWEM  can  rank  websites  correctly  on  news 
topics.  We  use  ESP  as  the  evaluation  metric  in  our  experiments,  and  the  overall  per¬ 
formance  is  evaluated  by  averaging  the  individual  ESP  values  over  the  10  topics. 

4.2  Performance  Evaluation 

The  parameters  are  set  in  the  following  ways.  The  threshold  in  the  merge-TWEM 
model  is  set  to  0.9  sinee  we  want  to  exert  strict  restrictions  on  the  merging  operation. 
The  controlling  factor  a  is  set  experimentally.  We  tune  it  from  0  to  1 .0  with  0. 1  as  the 
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step  size,  and  Figure  5  shows  the  variance  of  performance  for  TWEM  and  merge- 
TWEM.  We  can  see  that  both  models  perform  best  when  a  is  equal  to  0.3.  Therefore, 
we  set  a  to  0.3. 


rr*r*e-TWEM  - — H TWEM 


QL 

Fig.  5.  ESP  of  TWEM  and  merge-TWEM  as  a  varies 


We  compare  the  results  of  TWEM  and  merge-TWEM  with  that  of  Google  News 
and  Alexa.  With  Google  News,  we  can  only  get  ranks  of  news  articles.  For  a  testing 
topic  /  ,  we  take  the  procedures  in  Figure  6  to  generate  the  ranks  of  websites  from 
Google  News.  This  is  actually  a  Weighted  Voting  strategy  to  rank  websites.  Alexa 
ranks  have  been  widely  used  for  website  ranking  and  evaluation.  For  each  testing 
topic,  the  websites  in  the  dataset  are  ranked  simply  according  to  their  Alexa  ranks. 
The  Google  News  and  Alexa  results  are  also  evaluated  with  the  method  introduced  in 
Section  4.1.  Table  1  shows  the  experimental  results  of  various  methods. 

In  the  table,  we  can  see  that  both  TWEM  and  merge-TWEM  are  able  to  achieve 
better  performance  than  the  baselines.  TWEM  outperforms  Google  News  and  Alexa 


Step  I:  A  group  of  keywords  which  are  representative  of  t  are  submitted 
to  Google. 

Step  2:  The  first  400  web  pages  returned  by  Google  are  downloaded. 
Websites  are  extracted  from  the  URLs  of  these  web  pages. 

Step  3:  Each  website  is  assigned  a  Google  score  which  equals  the 
weighted  summation  of  the  Google  ranks  of  its  web  pages,  that  is 
sco re ( w ) =  ^ M  c, '  f'utik  ( pt ) 

where  pt  is  i,h  web  page  of  the  website  w  ,  rank(pt)  is  the  Google 
rank  of  the  page  pt ,  c,  is  the  coefficient  for  the  rank  of  /?  . 

Step  4:  The  websites  are  finally  ranked  according  to  their  Google  scores. 


Fig.  6.  Generation  of  Google  ranks  for  websites 
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by  25.5%  and  36.0%  respectively,  while  the  achievements  obtained  by  merge-TWEM 
are  27.6%  and  38.2%.  In  order  to  determine  whether  these  improvements  are  statisti¬ 
cally  significant,  we  perform  several  single-tailed  t-tests,  and  Table  2  gives  the  P- 
values  of  TWEM  and  merge-TWEM  compared  to  Google  News  and  Alexa.  From  this 
table,  we  find  that  either  TWEM  or  merge-TWEM  performs  significantly  better  than 
the  baselines  at  a  95%  eonfidenee  level. 


Table  1.  Performance  comparison  between  the  models  and  the  baselines 


Methods 

ESP 

TWEM 

0.6062 

merge-TWEM 

0.6162 

Google  News 

0.4830 

Alexa 

0.4460 

Table  2.  P-values  of  the  t-tests  (a)  P-values  of  TWEM  and  merge-TWEM  compared  to  Google 
News,  (b)  P-values  of  TWEM  and  merge-TWEM  compared  to  Alexa. 

(a) 

Methods _ P-values _ 

TWEM  0.0142 

merge-TWEM_ 0.0110 

(b) 

Methods _ P- values _ 

TWEM  0.0092 

merge-TWEM_ 8.24e-4 

Table  3.  Performance  comparison  among  TWEM,  TWEM-S,  TWEM-P,  TWEM-S-P 


Models _ ESP 

TWEM  0.6062 

TWEM-S  0.4360 

TWEM-P  0.4132 

TWEM-S-P  0.3208 


Moreover,  we  eonduet  comparison  between  the  merge-TWEM  and  TWEM  models. 
We  observe  from  Table  1  that  merge-TWEM  has  higher  ESP  than  TWEM  (0.6162  vs 
0.6062).  Also  Table  2  shows  that  merge-TWEM  has  smaller  P-values  than  TWEM  in 
any  eases.  This  proves  that  the  artiele  merging  operation  indeed  takes  effeets  in  boost¬ 
ing  the  performance  of  TWEM.  The  merge-TWEM  model,  which  considers  the  spe¬ 
cial  features  of  news  articles,  can  generate  better  ranking  results  and  thus  serve  as  a 
more  effective  solution  to  the  NWE  problem. 

Finally,  we  provide  a  detailed  view  of  the  TWEM  model.  TWEM  utilizes  support 
among  news  articles  and  popularity  of  websites.  We  investigate  the  inipaet  of  these 
two  faetors.  After  exeluding  eaeh  of  them  from  TWEM,  we  get  two  new  models,  i.e.. 


192 


Y.  Miao  et  al. 


TWEM-S  and  TWEM-P.  When  the  two  factors  are  both  excluded,  TWEM  boils  down 
to  the  basic  HITS  method,  which  is  named  as  TWEM-S-P.  These  newly-derived 
models  are  run  on  the  dataset  and  their  results  are  evaluated  in  the  same  way  as 
TWEM.  Table  3  shows  the  ESP  values  for  them.  In  the  table,  we  find  that  TWEM 
performs  better  than  all  the  other  three  models.  Among  the  two  factors,  popularity  of 
websites  brings  more  significant  improvements  (0.4132  VS  0.6062).  Also,  considera¬ 
tion  of  support  among  news  articles  also  improves  the  performance  greatly  (0.4360 
VS  0.6062).  Based  on  the  above  comparison,  we  conclude  that  these  two  factors  play 
important  roles  in  the  performance  of  TWEM. 

5  Conclusion  and  Future  Work 

In  this  paper,  we  study  extensively  the  problem  of  News  Website  Evaluation.  We  pro¬ 
pose  two  models,  i.e.,  TWEM  and  merge-TWEM,  to  solve  this  problem.  TWEM  ex¬ 
ploits  fully  the  relations  between  websites  and  news  articles  to  infer  the  importance 
scores  of  websites.  Also,  TWEM  utilizes  information  from  Alexa  to  represent  popular¬ 
ity  of  websites.  The  merge-TWEM  model  improves  TWEM  by  incorporating  the  article 
merging  operation.  The  experiments  show  that  both  TWEM  and  merge-TWEM  outper¬ 
form  the  baseline  systems  significantly.  In  addition,  the  merge-TWEM  model  achieves 
better  performance  than  TWEM,  and  is  a  more  effective  solution  to  the  NWE  problem. 

In  this  paper,  we  mainly  focus  on  the  importance  of  websites  on  news  topics.  In 
our  future  work,  we  will  consider  extending  the  proposed  models  to  other  types  of 
topics.  Furthermore,  our  evaluation  procedures  are  still  based  on  human  judgment. 
Therefore,  we  will  also  study  more  reasonable  and  objective  evaluation  methods. 
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Appendix  1:  Details  about  the  Dataset 


Topic  No. 

Brief  Descriptions 

Number  of 
Websites 

Number 
of  Pages 

1 

An  employee  jumped  from  the  fourth  floor  of  a  famous 
company's  office  building  and  died 

1 18 

208 

2 

Insiders  manipulated  the  stock  price  of  JSSH.  a 
bioengineering  company  in  China. 

42 

54 

3 

The  battery  of  Huntkey  exploded  in  a  foreign  lest,  which 
was  later  reported  widely. 

39 

64 

4 

Google  China  was  punished  hy  Chinese  government  for 
disseminating  vulgar  links  and  images. 

188 

365 

5 

A  famous  broadcaster  in  China  was  reported  to  be  a  spy. 
Finally,  this  proved  to  be  a  rumor. 

207 

367 

6 

A  farmer  claimed  thal  he  took  photos  of  Soulh  China  Tiger, 
a  speues  which  had  been  thought  lo  be  extinct. 

132 

361 

7 

In  China,  a  girl  in  a  TV  show  was  found  to  look  extremely 
like  a  super  star,  and  received  public  atiention 

78 

100 

8 

In  Sep  2009.  Lenovo  again  launched  laptop  computers 
specially  designed  for  university  students. 

122 

187 

9 

A  new  movie,  Sophie*  s  Revenge,  was  released  in  Sep  2009 
and  broke  the  box-office  records  in  China. 

81 

96 

10 

Windows  is  about  lo  be  released  in  Oct  2009.  There  are 
already  many  comments  and  discussions  about  it. 

96 

143 
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Abstract.  Recently  mining  frequent  substructures  from  XML  data  has  gained  a 
considerable  amount  of  interest.  Different  methods  have  been  proposed  and 
examined  for  mining  frequent  patterns  from  XML  documents  efficiently 
and  effectively.  While  many  frequent  XML  patterns  generated  are  useful  and 
interesting,  it  is  common  that  a  large  portion  of  them  is  not  considered  as 
interesting  or  significant  for  the  application  at  hand.  In  this  paper,  we  present  a 
systematic  approach  to  ascertain  whether  the  discovered  XML  patterns  are 
significant  and  not  just  coincidental  associations,  and  provide  a  precise 
statistical  approach  to  support  this  framework.  The  proposed  strategy  combines 
data  mining  and  statistical  measurement  techniques  to  discard  the  non 
significant  patterns.  In  this  paper  we  considered  the  “Prions”  database  that 
describes  the  protein  instances  stored  for  Human  Prions  Protein  The  proposed 
unified  framework  is  applied  on  this  dataset  to  demonstrate  its  effectiveness  in 
assessing  intcrestingness  of  discovered  XML  patterns  by  statistical  means 
When  the  dataset  is  used  for  classification/prediction  purposes,  the  proposed 
approach  will  discard  non  significant  XML  patterns,  without  the  cost  of  a 
reduction  in  the  accuracy  of  the  pattern  set  as  a  whole. 

Keywords:  data  mining,  interesting  rules,  statistical  analysis,  semi-structured 
data. 


1  Introduction 

Data  mining  or  knowledge  discovery  from  data  (KDD)  is  known  for  its  capabilities  in 
extracting  knowledge  that  is  comprehensible,  valid  on  tests  and  new  data  with  some 
degree  of  certainty,  potentially  useful,  actionable,  and  novel  [1].  With  the  fast  growth 
in  the  amount  of  electronic  data  such  as  Web  pages  and  XML  data,  this  offers  a  new 
dimension  in  pattern  recognition  and  rules  discovery.  These  electronic  data  are 
heterogeneous  collection  of  ill-structured  data  that  have  no  rigid  structures,  and  often 
referred  to  as  semi-structured  data  [2].  A  well  known  data  mining  technique,  namely 
association  rule  mining  is  widely  used  for  discovering  interesting  associations  and 
correlations  between  data  elements  in  a  diverse  range  of  applications.  While  there  are 
great  achievements  in  discovering  the  association  rules  within  the  well-structured 
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(relational)  data,  still  a  number  of  works  remain  in  preliminary  stages  for  semi- 
structurcd  data  [3|.  Sinec  the  introduction  of  the  association  rule  mining  problem  hy 
[4],  substantial  work  has  gone  into  various  trends,  including  the  development  of 
efficient  algorithms  in  finding  the  association  [5-7]  and  measuring  the  intercstingncss 
of  the  association  rules  in  structured  data  [8-14].  As  the  increase  in  data  captured  in 
semi  structured  format  such  XML  hegins  to  permeate  many  applications,  association 
rule  mining  from  the  semi-structurcd  data  has  become  a  new  and  interesting  research 
area  [15].  The  general  problems  of  association  rule  mining  include  the  extraction  of 
all  the  frequent  itemsets  from  which  association  rules  are  formed.  A  rule  is  said  to  be 
interesting  if  they  meet  certain  minimum  support  and  confidence  criteria  [3].  The 
same  holds  for  mining  the  frequent  substructures  in  scmi-structurcd  data  which 
comprise  candidate  substructure  enumeration  and  frequency  counting. 

Works  such  as  [2,  16-18]  focus  on  developing  algorithms  to  enable  efficient  and 
effective  association  rule  mining  from  semi -structured  data.  While  these  frequent 
substructure  mining  techniques  may  discover  an  interesting  association  from  a  given 
dataset,  the  problem  that  remains  is  that  they  may  only  reflect  aspects  of  the  database 
heing  observed.  As  such,  the  patterns  may  not  reflect  the  “real”  significant 
associations  between  the  underlying  structures.  This  problem  arises  because  some 
association  rules  are  discovered  due  to  pure  coincidence  resulting  from  certain 
randomness  in  the  particular  dataset  being  analyzed.  Since  the  nature  of  data  mining 
techniques  is  data  driven,  the  patterns  generated  by  these  techniques  must  be  validated 
by  a  statistical  methodology  for  them  to  be  useful  in  practice  [19].  Statistics  has 
previously  addressed  the  issues  of  how  to  separate  out  the  random  effects  to 
determine  if  the  measured  association  (or  difference  in  other  areas)  is  significant  [20]. 
Thus  additional  measures  based  on  statistical  independence  and  correlation  analysis 
are  needed  to  ensure  that  the  results  have  a  sound  statistical  basis  and  are  not  purely 
random  coincidence. 

Therefore,  the  motivation  behind  our  proposed  method  is  to  investigate  how  data 
mining  and  statistical  measurement  techniques  can  be  combined  to  arrive  at  more 
reliable  and  interesting  set  of  rules.  The  focus  of  the  work  presented  in  this  paper  is  to 
evaluate  the  frequent  substructures  extracted  from  XML  documents  and  verify  their 
significance  using  statistical  analysis.  In  this  paper  we  apply  the  1MB3  algorithm  [21] 
to  the  Prions  database  in  order  to  extract  the  frequently  occurring  substructures,  while 
statistical  analysis,  namely  Chi-Squared  and  Log-Linear  have  been  utilized  to 
ascertain  the  discovered  substructures.  In  the  next  section,  we  explain  the  problem  of 
discovering  and  ascertaining  association  rules  from  semi  structured  data.  In  Section 
III,  we  describe  some  related  works  in  the  area  of  frequent  substructure  mining  and 
finding  of  significant  patterns.  We  show  experimental  findings  of  significant 
substructures  in  Prions  dataset  in  Section  IV.  Section  V  concludes  the  paper  and 
explains  our  ongoing  work  in  this  field  of  study. 


2  Problem  Definition 

This  section  starts  by  describing  some  necessary  aspects  of  association  rule  mining  in 
the  context  of  XML  document  mining  which  will  lay  the  ground  work  to  define  the 
problem  of  ascertaining  patterns/association  rules  from  semi -structured  data.  XML 
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document  has  a  hierarchical  document  structure,  where  an  XML  element  may  contain 
further  embedded  elements,  and  these  can  be  attached  with  a  number  of  attributes. 
Elements  that  form  sibling  relationships  may  have  ordering  imposed  on  them.  Each 
element  of  an  XML  document  has  name  and  value.  Given  such  parallelisms,  an  XML 
document  can  therefore  be  modeled  as  a  rooted  labeled  ordered  tree,  where  a  node  in 
the  tree  corresponds  to  an  XML  element  [15,  17].  If  only  structure  is  to  be  considered, 
then  a  node  in  the  tree  will  only  correspond  to  an  element  name.  However,  in  the  case 
of  the  current  study  we  are  interested  in  attribute  names  and  the  attribute  values  from 
a  particular  domain,  and  henee  a  node  will  correspond  to  an  element  name  and  value. 

A  tree  can  be  denoted  as  T(vO,V,L,E),  where: 

(1)  vO  e  V  is  the  root  vertex; 

(2)  V  is  the  set  of  vertices  or  nodes ; 

(3)  L  is  the  set  of  labels  of  vertices,  for  any  vertex  ve  V ,  L(v)  denotes  the  label  of 
v;  and 

(4)  E  =  {(.v, y)l  x,ye  V  }  is  the  set  of  edges  in  the  tree. 

The  main  problem  in  association  mining  from  semi-structured  documents  such  as 
XML,  is  that  of  frequent  pattern  discovery,  where  a  pattern  corresponds  to  a  subtree 
in  this  ease,  and  a  transaction  to  a  fragment  of  the  database  tree  whereby  an 
independent  instance  is  described.  This  problem  is  more  complex  than  in  traditional 
frequent  pattern  mining  from  relational  data  beeause  structural  relationships  need  to 
be  taken  into  account  It  is  know  n  as  the  frequent  subtree  mining  problem,  and  ean 
be  generally  stated  as:  given  a  tree  database  T  and  minimum  support  threshold  (a), 
find  all  subtrees  that  oceur  at  least  a  times  in  T. 

Furthermore,  depending  on  the  domain  of  interest  and  the  task  that  is  to  be 
accomplished  in  a  particular  application,  different  types  of  subtrees  ean  be  mined 
using  different  support  definitions.  For  an  overview  of  existing  subtree  types  and 
support  definitions  and  their  usage  implications  for  general  knowledge  analysis  tasks 
please  refer  to  [22].  Many  frequent  subtree  mining  algorithms  have  been  developed  to 
date,  and  for  an  extensive  overview  of  the  current  state-of-the-art  in  the  field, 
including  comparisons  of  different  approaches  highlighting  their  advantages/ 
disadvantages,  wc  refer  the  interested  reader  to  [3,  15]. 

Due  to  the  nature  of  the  domain  considered  and  the  data  used  in  this  paper  we 
focus  on  ordered  induced  subtrees  and  the  transaction  based  support  definition  is 
used.  These  ean  be  formally  defined  as  follows: 

Definition  1.  Given  a  tree  S  =  ( vOs,  V^L^Es)  and  tree  T  =  (vOj  Vt*Et,Et),  S  is  an 
ordered  induced  subtree  of  T,  iff  (1)  Vsc:  VT  ;(2)  Lyc=  LT  and  Ls(v)=L1{v);  (3) 

Et\  and  (4)  the  left  to  right  ordering  of  sibling  nodes  in  the  original  tree  is  preserved. 

When  using  the  transaction-based  support  (TS)  definition,  the  transactional  support 
(a)  of  a  subtree  /,  denoted  as  ajt)  in  a  tree  database  Tdb  is  equal  to  the  number  of 
transactions  in  Tdb  that  contain  at  least  one  occurrence  of  subtree  t. 

Definition  2.  Let  the  notation  t  -<  k ,  denote  the  support  of  subtree  /  by  transaction  k , 
then  for  TS,  t^k  =  1  whenever  k  contains  at  least  one  occurrence  of  /,  and  0 
otherwise.  Suppose  that  there  are  N  transactions  kj  to  k#  of  tree  in  7V//;,  the  olr(t)  in  Tdh 
is  defined  as: 
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N 

1^*,  (i) 

i=l 

Hence,  in  our  current  work  we  focus  on  ascertaining  the  interestingness  of  discovered 
ordered  induced  subtree  patterns  that  have  been  extracted  from  a  tree-structured 
database  (XML),  and  that  satisfy  the  minimum  transaction-based  support  threshold. 

Let  us  denote  the  set  of  these  frequent  subtree  patterns  as  SF.  Please  note  that  the 
patterns  from  SF  have  not  been  assigned  a  particular  class  label  to  be  used  for  a 
predietion/classification  task,  and  as  such  simply  reflect  the  frequently  occurring 
associations,  that  may  not  necessarily  have  a  sound  statistical  basis.  Hence,  in  the  first 
problem  setting  our  aim  is  to  reduce  the  SF,  by  filtering  out  the  patterns  that  are  not 
statistically  significant  with  respect  to  the  statistical  measures  used. 

In  the  second  problem  setting,  one  of  the  attributes  from  the  data  is  considered  as  a 
class  to  be  predicted  for  classification  task  purposes.  Hence,  we  only  consider  those 
patterns  from  SF,  that  contain  this  class  attribute,  as  they  will  represent  the  set  of 
values  that  frequently  occur  together  when  a  particular  class  value  is  present.  Hence, 
as  such  these  patterns  can  be  seen  to  have  predictive  power  and  can  be  evaluated  for 
their  accuracy  on  correctly  predicting  the  class  value  from  the  trained  data  and  unseen 
data.  In  addition  to  predictive  accuracy,  simple  rules  arc  preferred  as  they  are  easier  to 
comprehend  and  are  expected  to  perform  better  on  unseen  data  since  they  are  more 
general.  Hence,  when  in  the  process  of  optimizing  a  rule  set,  a  trade-off  needs  to  be 
made  between  several  factors  and  the  common  ones  are: 

-  M '^classification  rate  (MR)  -  number  of  incorrectly  classified  instances 

-  Coverage  rate  (CR)  -  number  of  captured  instances 

-  Generalization  power  (GP)  -  capability  of  correctly  classifying  future  instances 

When  optimizing  the  rule  set,  the  MR  should  be  minimized  while  the  CR  should  be 
maximized.  GP  is  achieved  by  simplifying  the  rules  in  terms  of  overall  rule  set  size 
and  the  number  of  attribute  constraints  in  the  rule.  The  trade-off  occurs  especially 
when  the  data  set  is  characterized  by  continuous  attributes  where  a  valid  attribute 
range  constraint  needs  to  be  determined  for  a  particular  rule.  Increasing  the  range 
constraint  usually  leads  to  the  increase  in  CR  of  that  rule  but  at  the  cost  of  an  increase 
in  MR  of  that  rule.  Similarly,  if  the  rules  are  too  general,  they  may  lack  the  specificity 
to  distinguish  some  domain  characteristics  and  hence  the  MR  would  increase. 
Generally  speaking,  an  optimized  rule  set  should  be  either  more  accurate  than  the 
original  rule  set  and/or  the  balance  between  the  trade-off  factors  should  be  much 
greater.  For  example,  if  there  are  many  rules  with  small  CR  but  very  low  MR,  a  rule 
set  with  a  significantly  smaller  number  of  rules  may  be  preferred  even  at  the  cost  of 
an  increase  in  MR. 

Since  the  number  of  patterns/association  rules  generated  through  frequent  subtree 
mining  can  be  quite  large,  their  usefulness  for  classifieation/prediction  task  may  be 
limited  unless  they  are  significantly  reduced  in  size  and  number.  While  their  MR  may 
be  small,  their  GP  is  likely  to  be  poor  as  all  frequent  patterns  are  considered,  that  can 
be  insignificant,  redundant  and  unnecessarily  complex.  Hence  in  the  second  problem 
considered  in  this  paper,  we  aim  to  apply  a  variety  of  Statist ical/hcuristic  methods  to 
reduce  the  pattern/rule  set  size  and  simplify  individual  rules. 
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Let  us  denote  the  subtree  patterns  from  the  frequent  subtree  set  SF  that  have  a  class 
label  (value),  as  SFC.  The  problem  eonsidered  in  the  second  setting  can  be  stated  as. 
Given  SFC  with  accuracy  ac ,  reduce  SFC  into  SFC ’  such  that  SFC*  has  accuracy  > 
(cic  -  e),  such  that  8  is  an  arbitrary  user  defined  small  value  (8  is  used  to  reflect  the 
noise  that  is  often  present  in  real  world  data). 


3  Related  Works 

Our  work  in  this  paper  focused  on  ascertaining  the  XML  rules  discovered  from  an 
XML-enabled  association  rule  framework.  [17]  have  initiated  this  framework  which 
resulted  in  a  more  flexible  and  powerful  representations  of  both  simple  and  complex 
structured  association  relationships  inherent  in  XML  documents.  There  has  been  an 
active  development  of  frequent  subtree  mining  algorithms  [16,  18,  21,  23-25].  For  a 
more  detailed  description  of  the  existing  approaches  and  latest  development  on  these 
algorithms  please  refer  to  [3,  15].  Currently  there  has  been  limited  works  in  rule 
evaluation  phase  of  semi-structured  rules.  Many  of  the  well  developed  rule 
interestingness  measures  are  in  structured  data  and  they  have  had  great  success  in 
evaluating  rule  intercstingness  as  discussed  in  [12].  Initial  work  on  evaluating  the 
discovered  patterns  based  on  statistical  significant  are  [13,  26-28]  but  these  are 
limited  to  structural  data  The  existence  of  vast  well  developed  measuring  techniques 
to  evaluate  intercstingness  of  rules  from  relational  data,  offers  great  opportunities  in 
adapting  these  techniques  for  verifying  significant  substructures  from  semi -structure 
data.  The  applicability  of  these  intcrestingncss  measures  needs  to  be  explored  in 
context  of  frequent  substructure  mining,  where  necessary  adjustments  and  extensions 
need  to  take  place  to  ascertain  the  validity  of  the  methods  in  presence  of  more 
complex  structural  aspects  in  the  data,  which  often  need  to  be  preserved  in  the  rules. 

One  line  of  work  in  focusing  on  more  interesting  substructure  patterns  is  in 
reducing  the  patterns  and  the  application  of  plausible  constraints  techniques.  The 
problem  of  mining  mutually  dependent  ordered  subtrees  has  been  addressed  in  [29]. 
The  proposed  algorithm  utilizes  the  hypercliquc  method  [30]  in  the  tree  mining 
context  so  that  all  the  components  of  a  subtree  are  highly  correlated  together.  These 
hyperchque  subtree  patterns  are  discovered  using  a  h-confidence  measure  which  is  the 
minimum  probability  of  an  item  from  a  pattern  in  one  transaction  implying  the 
presence  of  all  other  items  in  the  same  transaction.  Hence,  the  extracted  hyperclique 
subtree  patterns  will  satisfy  the  minimum  h-confidence  threshold.  The  work  done  in 
[31]  uses  the  method  proposed  for  database  compression  in  regards  to  item  set  mining 
in  [32]  to  demonstrate  how  the  same  minimum  description  length  principle  can  yield 
good  results  for  sequential  and  tree -structured  data.  Another  notable  work  presented 
in  [33]  extends  the  idea  of  the  item  constraint  [34]  to  that  of  node-inclusion  constraint 
in  subtrees.  In  addition  to  that,  [35]  proposed  the  application  of  monotone  constrain 
namely  anti-monotone,  monotone  convertible  and  succinct  in  frequent  subtree  mining. 
Such  an  opportunistic  pruning  strategy  is  used  to  mine  frequent  subtrees  under  the 
defined  constraints.  An  approach  for  mining  of  frequent  subtrees  where  the  distance 
between  the  nodes  is  used  as  additional  grouping  criterion  has  been  presented  in  [36]. 
[37]  proposed  and  demonstrated  an  efficient  ways  to  discover  interesting  association 
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rules  from  dynamic  XML  documents.  The  work  done  in  [37]  mainly  motivated  by  the 
facts  that  the  XML  document’s  content  and  /or  structure  arc  always  fluctuates. 

Besides  the  aforementioned  constraint-based  techniques,  to  our  knowledge  we 
found  limited  works  on  verifying  the  significance  of  discovered  frequent  substructures 
The  frequent  occurring  substructure  discovered  from  frequent  substructure  mining 
algorithm  commonly  offers  a  complete  pattern  set  and  is  too  numerous  to  be  utilized 
efficiently  and  effectively  for  the  application  at  hand  [9,  38).  [39]  proposed  and 
developed  an  application  of  statistical  hypothesis  testing  to  re-rank  the  significant 
frequent  subtrees.  This  approach  ranks  the  significant  patterns  according  to  P-values 
obtained  from  the  Fisher's  Exact  test  of  significance.  The  significant  patterns  were 
then  used  for  Glycan  classifications  problems.  Recently  [38],  proposed  a  mining 
framework  called  LEAP  (Descending  L.eap  Mine)  in  checking  and  mining  a  significant 
frequent  subgraph  which  will  help  in  discarding  redundant  frequent  subgraphs.  For  a 
predefined  class  label  in  XML  documents,  an  efficient  XRules  classifier  have  been 
develop  by  [40].  This  approach  offers  promising  results  in  terms  of  the  structural 
classifier  for  semi-structured  data. 

In  this  work  we  employed  the  1MB3  miner  algorithm  for  mining  ordered  embedded 
subtrees.  While  these  algorithms,  offer  some  constraints  in  discovering  strong 
patterns/rules,  many  misleading,  uninteresting  and  insignificant  rules  in  that  domains 
may  still  be  produced  [1].  The  problem  arises  because  some  association  rules  are 
discovered  due  to  pure  coincidence  resulting  from  certain  randomness  in  the 
particular  dataset  being  analyzed.  Statistics  has  previously  addressed  the  issues  of 
how  to  separate  out  the  random  effects  to  determine  if  the  measured  association  (or 
difference  in  other  areas)  is  significant  [20].  Thus  additional  measures  based  on 
statistical  independence  and  correlation  analysis  are  needed  to  ensure  that  the  results 
have  a  sound  statistical  basis  and  are  not  purely  random  coincidence. 

A  common  multivariate  statistical  analysis  is  the  association  analysis  problem  [20]. 
For  associations  between  categorical  variables  there  arc  several  inferential  methods 
involved.  Chi-Squared  analysis  is  often  used  to  measure  the  difference  between 
observed  and  expected  frequencies.  The  significance  used  of  the  Chi-Squared 
statistics  is  for  hypothesis  testing  in  tests  of  independence.  In  addition  to  that  the 
Log-Linear  analysis  offers  a  unique  feature  in  capturing  interrelationship  among  data 
items  [41  ]. 

4  Experimental  Results 

The  evaluation  of  the  unification  framework  is  performed  using  the  Prions  database 
which  is  a  type  of  infectious  agent.  Prions  are  abnormally  structured  forms  of  host 
protein,  which  are  able  to  convert  normal  molecules  of  protein  into  abnormally 
structured  form.  Prions  dataset  describes  Protein  Ontology  database  for  Human  Prions 
proteins  in  XML.  format  [42].  It  consists  of  17348  protein  sequences.  The  XML  tags 
and  values  are  first  mapped  to  integer  indexes  similar  to  the  format  used  in  [21]  and 
[25 f  Representing  label  as  integer  instead  of  a  string  label  has  considerable 
performance  and  space  advantages  [21 1.  In  this  section,  we  first  show  the  generated 
patterns  obtained  from  frequent  subtree  mining  approach,  namely  1MB3  algorithm  in 
Section  4.1.  Then  we  apply  the  two  prominent  statistical  measurement  techniques 
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namely  Chi-Squared  analysis  and  Log-Linear  analysis  in  measuring  the  significance 
of  the  discovered  frequent  patterns  in  Section  4.2.  In  Section  4.3,  we  consider  the 
Prions  protein  database  in  a  classification/prediction  problem  setting.  We  have  labeled 
the  protein  instances  as  either  referring  to  Human's  or  Animal’s  protein.  Wc  then 
verified  the  extracted  patterns  using  the  statistical  analysis. 

4.1  Extracted  Frequent  Patterns 

The  discovery  of  structural  patterns  by  matching  data  representation  structures  is 
essential  for  analysis  and  understanding  of  data.  If  a  structural  pattern  occurs 
frequently,  it  is  ought  to  be  important  in  some  way.  On  the  other  hand,  infrequent 
patterns  may  also  provide  meaningful  information  [42].  Thus  to  extract  meaningful 
information  from  XML  data  we  need  to  mine  structural  patterns.  In  discovering  the 
frequent  patterns  from  Prions  dataset  wc  apply  the  1MB3  algorithm.  There  are  a  total 
of  27  occurring  patterns  discovered  by  IMB3  algorithm.  The  minimum  support  value 
used  was  10  %  and  we  managed  to  discover  subtree  patterns  with  the  largest  ones 
consisting  of  5  nodes.  Table  1  shows  several  examples  of  patterns  discovered. 

Table  1.  Examples  of  Several  Patterns  Discovered  Based  on  Frequent  Tree  Mining  Technique 


Patterns  # 

Patterns 

#  of  Occurrences 

1 

ATOMChain(A) 

Flement(C) 

3957 

2 

ATOMChain(A) 

ATOMResidual(TYR) 
Occupancy  ( 1 ) 

1743 

3 

ATOMChain(A) 

Occupancy( 1 ) 

Temperature(O) 

Element(C) 

3805 

Pattern  number  1  shows  an  association  between  ATOMChain(A)  with  Element(C) 
and  this  pattern  was  discovered  3957  times.  Here  the  ATOMClmin  with  value  A 
associates  to  Elements  with  value  C.  The  patterns  discovered  by  the  IMB3  algorithm 
can  aid  in  discovering  potentially  useful  pattern  structures  in  Protein  Ontology 
datasets,  which  makes  it  useful  for  comparison  of  protein  datasets  taken  across  protein 
families  and  species  and  helps  in  discovering  interesting  similarities  and  differences. 
However,  the  question  still  remains  whether  these  patterns  are  discovered  due  to  pure 
coincidence  resulting  from  certain  randomness  in  the  particular  dataset  being  analyzed. 
Furthermore,  they  are  often  quite  large  in  number,  which  can  degrade  the  analysis 
procedure,  and  hence  in  the  next  section  wc  measure  the  statistical  significance  of  the 
discovered  patterns,  in  order  to  remove  any  non-significant  patterns. 

4.2  Frequent  Patterns  Significant  Test 

Statistical  analysis  approaches,  namely  Chi-Squared  and  Log-Linear  analysis  were 
employed  in  order  to  determine  the  usefulness  of  frequent  rules  obtained.  The  results 
from  Chi-Squared  analysis  are  discussed  first. 
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Table  2.  Patterns  Verification  Based  on  Chi-Squared  Analysis 


Node  Name 

Sig.  Att.  Value 

ATOMResiduaUTYR) 

Occupancy  ( 1 ) 

Not  Sig. 

Occupancy(  1 ) 

Temperaturc(O) 

Not  Sig. 

ATOMChain(A) 

Occupancy!  1 ) 

Not  Sig. 

ATOMChain(A) 

Element(C) 

Sig. 

ATOMChain(A) 

Element!  H) 

Sig. 

Temperature(O) 

Element(C) 

Sig. 

Temperature!!)) 

Element!  H) 

Sig. 

ATOMChain(A) 

ATOMResiduaUTYR) 

Sig. 

ProteinOntologylD(3) 

Occupancy! 1 ) 

Not  Sig. 

ProteinOntologyID(3) 

Elcment(C) 

Sig. 

ATOMChain(A) 

Temperature!  0) 

Sig. 

Occupancy(l) 

Temperature!  1 ) 

Not  Sig. 

Occupancy ( 1 ) 

Element(N) 

Not  Sig. 

Oecupancy(  1 ) 

Element(C) 

Not  Sig. 

Occupancy!  1 ) 

Elemcnt(O) 

Not  Sig. 

Occupancy!  1 ) 

Element!  H) 

Not  Sig. 

Table  2  shows  that,  there  are  16  association  relationships  among  structures-values 
items  discovered  using  the  IMB3  algorithm.  Based  on  Chi-Squared  analysis,  7  out  of 
16  relationships  are  significant.  Table  3  shows  1 1  patterns  with  more  than  two  nodes. 
We  apply  the  Log-Linear  analysis  in  examining  the  association  between  these  nodes. 
Only  one  pattern  out  of  1 1  patterns  is  accepted  as  a  significant  pattern  based  on  this 
analysis.  Based  on  the  Log-Linear  analysis,  we  can  conclude  that,  there  is  significant 
association  between  ATOMChain(A),  Temper ature(0)  and  Element(H). 


Table  3.  Patterns  Verification  Based  on  Log-Linear  Analysis 


Node  Name 

Sig.  Att.  Value 

ATOMChain(A) 
ATOMChain(A) 
ATOMChain!  A) 
ProteinOnto(3) 
ATOMChain(A) 
Occupancy!  1 ) 
ATOMChain!  A) 
Occupancy! 1 ) 
ATOM  Chain!  A) 
ATOMChain(A) 
ATOMChain!  A) 

ATOMResidue(TYR) 
Occupancy! 1 ) 
Occupancy! 1 ) 
Occupancy! 1 ) 
Occupancy! 1 ) 
Tcmperature(O) 
Temperature!  0) 
Temperature!!)) 
Temperature(O) 
Occupancy! 1 ) 
Occupancy!  1 ) 

Occupancy!  1 ) 

TempcTature(O) 

Elcmcnt(C) 

Elemcnt(C) 

Element!  H) 

Element(C) 

Elcment(C) 

Elemcnt(H) 

Elemcnt(H) 

Tempcrature(O) 

Temperature!  0) 

Element(C) 
Element!  H) 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Not  Sig. 

Sig. 

Not  Sig. 

Not  Sig. 

4.3  Prions  as  a  Classification  Problem 

As  in  our  previous  work  [11],  the  unification  framework  involves  several  steps  in 
ascertaining  the  rules  discovered  from  association  rules  mtning  process.  For  Prions 
dataset,  the  similar  steps  were  followed.  We  defined  a  new  variable  (target  variable) 
identified  as  Human  Protein  or  Animal  Protein  class.  This  new  variable  was  derived 
from  ProteinOntologyll)  and  SuperFamily  variables.  Hence,  we  have  excluded  the 
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ProteinOntologylD  and  SuperFamily  variables  from  the  dataset  to  be  considered  in 
this  task.  Thus  in  this  classification  problem  we  have  chosen  the  target  variable  (i.e. 
Human  or  Animal's  Protein)  as  the  right  hand  side/consequenee  of  the  association 
rules. 

In  this  experiment,  we  divided  the  Prions  dataset  into  60%  of  training  set  and  40% 
of  testing  set.  Then  we  apply  the  preprocessing  techniques  including  the  missing 
values  removal  and  discretization  of  attributes  with  continuous  data.  The  equal  depth 
binning  approach  method  was  selected  as  this  approached  offered  a  better  result  as 
discussed  in  [11].  The  determination  of  relevant  attributes  with  respect  to  being  able 
to  predict  the  target  attributes  is  shown  in  Table  4.  This  is  based  on  Symmetrical  Tau 
[43]  and  Mutual  Information  [12]  techniques.  As  discussed  in  [11],  the  Symmetrical 
Tau  (ST)  approach  offers  better  output  in  discriminating  criterions  for  class  to  be 
predicted  in  comparison  to  Mutual  Information  (MI),  as  it  does  not  favor  multi-valued 
attributes.  The  attributes  with  ST  values  that  are  respectively  lower  than  other 
attribute's  ST  values,  are  considered  as  irrelevant  for  the  task.  The  significant 
difference  was  considered  to  occur  at  the  position  where  that  attribute's  ST  value  is 
less  than  half  of  the  previous  attribute's  ST  value  in  the  ranking.  Hence  for  this 
dataset,  attributes  4 Occupancy 1  and  T  were  considered  as  irrelevant  for  the 
prediction  task  and  were  removed. 


Table  4.  Comparison  between  ST  and  Ml  for  Prions  Dataset 


Variables 

ST  Values 

Variables 

MI  Values 

ATOMChain 

0.2088 

ATOMChain 

0.2605 

Temperature 

0.1230 

Z 

0.1610 

Z 

0.0812 

Teniperalure 

0.1526 

ATOMid 

0.0407 

ATOMResSeqNum 

0.1053 

ATOMResSeqNum 

0.0280 

ATOMid 

0.0721 

X 

0.0256 

X 

00549 

Element 

0.0153 

Atom 

0.0238 

Atom 

0.0109 

ATOMResidue 

0.0187 

ATOMResidue 

0.0082 

Element 

0.0162 

Y 

0.0029 

Y 

0.0048 

Occupancy 

0,0001 

Occupaney 

0.0000 

Table  5.  Examples  of  Prions  Rules 


Set  Size 

Confidence 

Support 

Count 

Rules 

2 

75.32 

8.97 

934 

X(g)  ==>  Class  (Animal) 

4 

61.71 

6.66 

693 

X(d)  &  Z(b)  &  ATOMChain(A) 
==>  Class  (Human) 

Next,  the  rules  are  then  generated  based  on  the  minimum  support  and  confidence 
framework  of  5%  and  60%  respectively.  Table  5  shows  examples  of  the  generated 
rules.  The  discovered  rules  are  then  ascertained  with  statistical  techniques  namely  Chi 
Squared  [20]  and  Logistics  Regression  [20].  Based  on  these  statistical  analyses  wc 
found  that  only  variables  ATOM  Chain ,  ATOM  Residual,  ATOMResSeqNunu  X  and  Z 
were  significant  contributors  towards  target  variable  of  class  Human  or  Animal. 
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Additional  constraint  measurement  techniques  were  applied  in  order  to  discard  the 
existence  of  redundant  rules  [IK  13].  The  combination  of  these  rule  ascertaining 
strategies  will  facilitate  the  association  rule  mining  framework  to  determine  the  right 
and  high  quality  rules.  These  rules  will  have  a  sound  statistical  basis  and  we  can  be 
more  confident  that  they  reflect  the  real  world  situation. 

In  Table  6  we  show  the  progressive  difference  in  the  number  of  rules  generated  as 
statistical  analysis  and  redundancy  checks  are  being  utilized.  Wc  also  show  the 
respective  classification  (%  of  correctly  classified  instances  from  the  training  set)  and 
predictive  accuracy  (%  of  correctly  classified  instances  from  the  training  set)  of  those 
rule  sets.  Upon  a  removal  of  73%  rules,  we  found  that  both  classification  and 
predictive  accuracies  have  increased  by  more  than  5%.  This  demonstrates  the 
importance  of  ascertaining  the  association  rules  by  statistical  analysis  and  redundancy 
check,  as  in  this  particular  scenario  the  simplified  rule  set  is  more  general  and 
performs  better  on  unseen  data. 

The  combination  of  statistical  significance  analysis  and  redundant  analysis 
provided  proper  ways  in  discarding  non  significant  rules,  which  is  a  significant 
reduction  in  the  overall  complexity  of  the  rule  set.  From  Table  6  we  can  also  see  that 
this  great  reduction  of  rules  was  not  at  a  cost  of  a  reduction  in  accuracy,  as  it  in  fact 
increased  for  the  Prions  dataset  in  classifying  and  predicting  the  protein  classes. 

Table  6.  Rules  Accuracy  for  Prions  Data 


*  Dataset 

Rule 

Type  of 

Accuracy 

Description 

# 

Analysis 

Classification 

Prediction 

Train  :  10407  records 

42 

Initial  Riles 

74.36% 

75.00% 

Test  :  6938  records 

11 

Statistical  Analysis  / 
Redundancy  Check 

79.97% 

80.37% 

*  Two  records  with  missing  values  were  discarded. 


5  Conclusions  and  Future  Works 

This  was  our  preliminary  work  towards  the  combination  of  data  mining  and  statistical 
techniques  in  ascertaining  the  rules/patterns  from  semi-structured  data.  The  combination 
of  the  approaches  used  in  this  method  demonstrated  a  number  of  ways  for  ascertaining 
the  significant  patterns  obtained  using  frequent  subtree  mining  approaches.  In  this  paper 
we  employed  statistical  analysis  that  provides  some  control  in  lowering  the  risk  of 
discovering  a  pattern  that  is  false  and  spurious.  In  our  future  work  we  aim  to  test  the 
approach  using  tree-structured  data  of  various  characteristics  and  complexities. 
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Abstract.  This  paper  proposes  an  approach  toward  improving  rc-coloring  based 
clustering  with  graph  b-coloring.  Previous  b-coloring  based  clustering  algorithm 
did  not  consider  the  quality  of  clusters.  Although  a  greedy  re-colonng  algo¬ 
rithm  was  proposed,  it  was  still  restrictive  in  terms  of  the  explored  search  space 
due  to  its  greedy  and  sequential  re-coloring  process.  We  aim  at  overcoming  the 
limitations  by  enlarging  the  search  space  for  re-coloring,  while  guaranteeing  b- 
coloring  properties.  A  best  first  re-coloring  algorithm  is  proposed  to  realize  non- 
greedy  search  for  the  admissible  colors  of  vertices.  A  color  exchange  algorithm  is 
proposed  to  remedy  the  problem  in  sequential  rc-coloring.  These  algorithms  are 
orthogonal  with  respect  to  the  re-colored  vertices  and  thus  can  be  utilized  in  con 
junction.  Preliminary  evaluations  are  conducted  over  several  benchmark  datasets, 
and  the  results  arc  encouraging. 


1  Introduction 

When  the  dissimilarities  among  data  items  are  specified,  the  entire  data  items  can  be 
represented  as  a  graph  structure,  where  each  data  item  is  mapped  to  a  vertex  and  the 
vertices  are  connected  by  edges  with  the  corresponding  dissimilarities.  Several  graph- 
based  clustering  methods  have  been  proposed  [7,9,16].  Recently,  [12]  proposed  the 
notion  of  b-coloring  of  undirected  graphs.  A  graph  b-coloring  is  a  vertex  coloring,  and 
it  satisfies  the  following  constraints:  (t)  adjacent  vertices  have  different  colors,  (ii)  in 
each  color,  at  least  one  vertex  is  adjacent  to  all  the  other  colors.  Based  on  this,  [5] 
proposed  a  clustering  method,  but  it  did  not  consider  the  quality  of  clusters.  Although 
a  re-coloring  algorithm  was  proposed  to  reflect  the  quality  of  clusters  [6],  it  was  still 
restrictive  in  terms  of  the  explored  search  space  due  to  its  greedy  and  sequential  process. 

This  paper  proposes  an  approach  toward  improving  rc-coloring  based  clustering  with 
graph  b-coloring.  The  vertices  in  a  graph  arc  divided  into  two  disjoint  subsets  based  on 
the  property  of  b-coloring.  A  best  first  re-coloring  algorithm  is  proposed  to  realize  non- 
greedy  search  for  the  admissible  colors  of  vertices  in  one  subset.  The  constraint  (i) 
can  make  it  impossible  to  re-color  vertices  in  sequential  approach.  A  color  exchange 
algorithm  is  proposed  so  that  this  problem  can  be  resolved.  Both  algorithms  enlarge 
the  search  space  for  re-coloring,  and  re-color  the  vertices  to  improve  the  quality  of 
clusters.  Since  these  algorithms  are  orthogonal  with  respect  to  the  re-colored  vertices, 
they  can  be  utilized  in  conjunction.  Preliminary  evaluations  are  conducted  over  several 
UCI  datasets.  The  results  are  encouraging  for  pursuing  this  line  of  research,  especially 
with  respect  to  the  ground  truth  micro-averaged  precision 
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1.1  Related  Work 

In  general,  clustering  methods  are  divided  into  hierarchical  methods  and  partitioning 
methods  [13].  Hierarchical  methods  eonstruet  a  cluster  hierarchy,  or  a  tree  of  clus¬ 
ters  (ealled  a  dendrogram),  where  leaves  correspond  to  data  items  and  internal  nodes 
to  nested  clusters  of  various  sizes  [8|.  On  the  other  hand,  partitioning  methods  return  a 
single  partition  of  the  entire  data  under  a  fixed  parameters  (number  of  clusters,  thresh¬ 
olds,  etc.).  Each  cluster  can  be  represented  by  its  centroid  1 10]  or  by  one  of  its  objects 
loeated  around  its  center  [  15]. 

Until  now  various  clustering  methods  have  been  proposed  based  on  graph-theoretic 
concepts.  In  one  approach,  a  partition  of  data  items  is  obtained  by  removing  edges 
and  dividing  the  graph  into  several  disconnected  components.  For  instance,  in  spectral 
clustering,  removal  of  edges  are  eonceived  in  terms  of  the  minimum  cut  of  the  graph, 
and  eigenvectors  of  the  (normalized)  graph  Laplaeian  is  utilized  [  1 6].  Other  approaches 
utilize  graph  coloring  techniques  [4|. 

As  for  the  coloring  based  approach,  a  hierarchical  agglomerative  clustering  method 
was  proposed  in  [7].  It  conducts  a  2 -coloring  of  vertices  in  order  to  find  out  a  maximum 
spanning  tree  of  a  graph.  In  [91,  partitioning  of  data  items  into  clusters  is  conceived 
in  terms  of  the  minimal  coloring  of  a  graph.  Our  approach  is  yet  another  graph  based 
partitioning  method  based  on  vertex  coloring  of  a  graph 

Section  2  describes  an  overview  of  b-coloring  based  clustering  and  points  out  some 
issues.  The  details  of  our  proposal  is  presented  in  Section  3.  Preliminary  evaluations 
are  reported  in  Section  4  and  the  results  are  discussed.  Section  5  describes  concluding 
remarks  and  indicates  future  directions. 


2  b-Coloring  Based  Clustering 

2.1  Preliminaries 

Wc  use  a  bold  capital  letter  to 
denote  a  set  of  objects.  For  a 
set  V\  |V|  represents  its  cardi¬ 
nality.  A  graph  G(V.E)  con¬ 
sists  of  a  set  of  vertices  V  and 
a  set  of  edges  E  over  V  x  V . 

Wc  assume  that  G(V  E)  is  an 
undirected,  simple  graph  with¬ 
out  self-loop.  The  symbol  A  de¬ 
notes  the  maximum  degree  in  a 
graph  1 4]. 

Suppose  data  items  are  clus¬ 
tered  or  grouped  into  a  partition 
P  =  {< CuC2,...,Ck }.  where 
Ci  stands  for  a  group  (cluster)  of  data  items.  Sinee  each  cluster  is  represented  as  a  color 
in  our  approach,  we  abuse  the  symbol  P  to  represent  both  the  set  of  clusters  and  the  set 
of  colors  in  a  graph. 


Table  1.  Notations 


symbol 

description 

n 

the  number  of  vertices 

i  n 

the  number  of  edges 

A 

maximum  degree  of  a  graph 

P 

the  set  of  colors  in  a  graph 

c(i'i) 

the  color  of  vertex  v, 

N(v,) 

neighboring  vertices  of  vertex  v, 

Nc(vi) 

neighboring  colors  of  vertex  r, 

CP(vi) 

admissible  colors  for  vertex  r, 

dissimilarity  between  r,  and  Vj 

du(v,Ct) 

average  dissimilarity  between  v  and  Ci 

208 


H.  Ogino  and  T.  Yoshida 


For  a  graph  G(  V\  E),  we  define  several  functions  over  the  vertices  V  in  G.  A  func¬ 
tion  N(v)  returns  the  set  of  vertices  adjacent  to  the  vertex  v.  A  function  c(v)  returns 
the  color  of  v  in  G,  and  a  function  Nc{v)  returns  the  set  of  neighboring  colors  to  v. 
A  function  Cp(v)  returns  the  set  of  admissible  colors  for  t\  i.t the  colors  which  are 
different  from  Nc(v).  Note  that  Cp(v)  contains  the  original  color  c(v)  of  v. 

It  is  assumed  that  a  dissimilarity  function  <7:  V  x  V  — *  7?.+  is  specified  for  data 
items  V.  For  instance,  d(vj ,  vj )  returns  the  dissimilarity  between  the  pair  of  vertices  v2 
and  vj.  For  Vv  G  V,  VC*  G  P,  an  average  dissimilarity  between  v  and  C,  is  defined  as 

da  (in  Ci )  =  T-^T  </(«,  vp )  (  1 ) 

1  l|  t„ec\ 


where  |C;|  denotes  the  size  of  cluster  C*. 

The  above  notations  are  summarized  in  Table  1 . 


2.2  A  Validation  Index  for  Clustering 


The  objective  of  data  clustering  is  to  find  out  a  partition  with  large  intra-cluster  co¬ 
hesion  and  inter-cluster  separation  1 13].  Various  validation  indices  for  clustering  have 
been  proposed  [2].  Among  them,  we  utilize  an  index  called  generalized  Dunn’s  index 
Dunne •  Dunne  *s  designed  to  offer  a  compromise  between  the  inter-cluster  separa¬ 
tion  and  the  intra-cluster  cohesion . 

For  any  C/t  G  P,  an  average  within-cluster  dissimilarity  is  defined  as 


SJCh) 


1 

icadai- 1) 


5Z  2 

v£Oh  v'tzCh 


(2) 


For  any  pair  of  clusters  Ci,Cj  G  P,  an  average  between-clustcr  dissimilarity  is  defined 
as 


da(ChCj) 


1 

\C.I\C, 


E  E 

t'€G\  v'€Cj 


(3) 


Based  on  the  above,  generalized  Dunn’s  index  for  a  partition  P  is  defined  as 


Dmitie(P)  = 


min  da  (Ci ,  Gj ) 
_ 

maxSa(G/l) 


(4) 


where  C/M  C*,  Gj  G  P.  The  larger  Pun7?c(P)  is,  the  better  the  partition  (coloring). 


2.3  b-Coloring  Based  Clustering 

The  notion  of  graph  b-coloring  was  proposed  in  [12].  A  b-coloring  of  an  undirected 
graph  G  is  a  vertex  coloring  of  G  and  satisfies  the  following  two  constraints: 

(i)  adjacent  vertices  have  different  colors  (proper  coloring) 

(ii)  for  each  color,  there  exists  at  least  one  vertex  (called  a  b-dominating  vertex)  which 
is  adjacent  to  all  the  other  colors. 
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Fig.  I.  A  graph  (with  0  Fig.  2.  A  b-coloring  of  the  graph  Fig.  3.  Another  b-coloring  for 

0.15)  for  the  data  in  Table  2  in  Fig.  1  (Dunne  =  0.916)  Fig.  1  (Dunne  —  1.000) 

By  assuming  that  some  dissimi-  Tabic  2.  An  example  of  data  set  with  dissimilarities 
larity  measure  and  a  threshold  6  arc 
given,  [5]  proposed  a  clustering  algo¬ 
rithm  based  on  b-coloring.  Each  data 
item  is  mapped  to  a  vertex,  and  ver¬ 
tices  are  connected  if  their  dissimilar¬ 
ity  is  greater  than  0.  Thus,  the  entire 
data  items  are  represented  as  a  sim¬ 
ple  graph  G(V .  E).  For  example,  for 
the  data  items  with  dissimilarities  in 
Table  2,  Fig.  1  is  the  corresponding 
graph  when  0  is  set  to  0.15.  Fig.  2  is 
an  example  of  b-eoloring  of  the  graph 
in  Fig.  1.  This  coloring  is  obtained  by  the  algorithm  in  [5].  The  vertices  w  ith  the  same 
color  (shape)  arc  grouped  into  the  same  cluster.  Thus,  {a,c,g,i|,  {d,f},  {e,h},  {b},  are 
the  clusters  in  Fig.  2. 

Note  that  the  graph  is  constructed  such  that  the  pairs  of  vertices  wdth  dissimilarity 
greater  than  0  are  connected.  Thus,  adjacent  vertices  should  be  assigned  to  different 
clusters  (eolors),  since  they  are  “far  away’1  from  each  other.  This  is  guaranteed  by  the 
constraint  (i).  As  the  result,  the  data  items  within  the  same  cluster  are  not  dissimilar 
w  ith  each  other.  This  corresponds  to  sustaining  infra-cluster  cohesion. 

On  the  other  hand,  from  (ii),  each  cluster  contains  at  least  one  b-dominating  vertex, 
whieh  is  adjaeent  to  all  the  other  clusters  and  thus  is  far  (dissimilar)  from  them.  This 
corresponds  to  sustaining  inter-cluster  separation.  Especially,  a  b-dominating  vertex  in 
(ii)  justifies  the  creation  of  the  elustcr  with  the  vertex;  sinee  it  eannot  be  assigned  to  all 
the  other  clusters,  the  duster  needs  to  be  ereated  to  inelude  it. 

As  explained  in  Section  2.2,  finding  out  a  partition  with  large  intra-cluster  cohe¬ 
sion  and  inter-cluster  separation  is  important  in  clustering.  These  ean  be  pursued  in 
b-eoloring  based  clustering  via  the  constraints:  the  former  by  (i),  and  the  latter  by  (ii). 

2.4  Previous  Re-coloring  Method 

Even  for  the  same  graph  and  the  same  number  of  clusters  (eolors),  the  graph  in  Fig.  1 
has  other  different  partition  (b-coloring)  with  better  quality  (cf.  with  larger  Dunne). 
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Fig.  4.  Result  with  algorithm  Fig. 5.  Result  with  algorithm  Fig.  6.  Result  with  both  algo- 
BFReColoring  ( Dunne  =  ExColors  (Dunne  =  1.000)  rithms  (Dunne  =  1.750) 
1.500) 

An  example  of  another  b-coloring  is  shown  in  Fig.  3.  The  coloring  in  Fig.  3  ( Dunne  = 
1.000)  is  better  than  that  in  Fig.  2  ( Dunne  =  0.916)  w.r.t.  the  index  in  eq.(4). 

In  order  to  find  better  partitions,  [6]  proposed  a  greedy  re-coloring  algorithm.  For  a 
graph  and  its  coloring,  the  colors  of  vertices  are  changed  (re-colored)  sequentially  under 
the  constraint  that  the  number  of  b-dominating  vertices  is  not  decreased.  For  instance, 
for  the  graph  and  its  coloring  in  Fig.  2,  the  greedy  algorithm  in  [6]  gives  the  b-coloring 
in  Fig.  3  with  better  quality. 

Effectiveness  of  the  re-coloring  based  approach  for  obtaining  better  clusters  was 
demonstrated  in  [6],  however,  the  algorithm  still  has  several  limitations: 

i)  greedy  procedure:  the  colors  of  re-colored  vertices  were  never  modified  again. 
Thus,  other  possibly  better  partitions  could  not  be  obtained. 

ii)  vertices  for  re-coloring:  only  small  portion  of  vertices  were  re-colored  in  order  to 
guarantee  the  termination  of  the  algorithm. 

iii)  inaccurate  quality  estimation  of  clusters:  not  all  vertices  were  utilized  for  quality 
estimation. 

To  cope  with  these  issues,  we  propose  an  extended  re-coloring  approach.  As  for  i),  wc 
propose  a  best  first  re-coloring  algorithm  (Section  3.2)  to  realize  the  non-greedy  search 
for  a  better  partition.  As  for  ii),  we  propose  a  color  exchange  algorithm  (Section  3.3) 
so  that  more  vertices  can  be  tested  for  re-coloring.  As  for  iii),  instead  of  the  subset  of 
vertices,  we  utilize  all  the  vertices  in  a  graph  for  estimating  the  quality  of  the  partition. 

For  instance,  for  the  same  graph  and  its  coloring  in  Fig.  2,  in  the  proposed  approach, 
the  colorings  in  Fig.  4  ( Dunne  ~  1-500)  and  Fig.  5  {Dunne,  —  1-000)  are  ob¬ 
tained  by  the  algorithms  BFReColoring  (Section  3.2)  and  ExColors  (Section  3.3), 
respectively.  These  are  better  or  at  least  with  the  same  quality  compared  with  Fig.  2 
{Dunne  =  0.916)  and  Fig.  3  {Dunne  =  1-000).  Furthermore,  since  these  algorithms 
are  orthogonal  with  respect  to  the  re-colored  vertices,  by  utilizing  both  of  them,  even  a 
better  partition  can  be  obtained,  as  shown  in  Fig.  6  (DwnnG=1.750). 
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3  Re-coloring  Algorithms  Based  on  Graph  b-Coloring 

3.1  Definitions 

For  a  b-coloringof  a  graph  G(V,E),  a  set  of  ver¬ 
tices  Vd  consists  of  b-dominating  vertices  in  the 
coloring.  For  each  b-dominating  vertex  rj  E  Vd 
it’  vs  E  N(vd)1  is  the  only  vertex  with  the  eolor 
r(t’*)  in  N(vd ).  this  vertex  is  ealled  a  support¬ 
ing  vertex  of  i\t.  Vs  denotes  the  set  of  support¬ 
ing  vertices  in  the  coloring.  The  set  of  vertices 
Vr  =  Vd  U  V8  are  called  critical  vertices.  On  the 
other  hand,  the  set  of  vertices  Vric  =  V  \  Vc  are  called  non-critical  vertices2.  These 
are  summarized  in  Table  3. 

The  proposed  algorithms  re-color  the  vertices  when  the  quality  of  clusters  is  im¬ 
proved.  Note  that  it  is  not  assumed  whieh  quality  measure  is  utilized.  In  the  following 
description,  r/(  )  stands  for  a  quality  measure  of  a  partition  (ef.  Dunn c;). 


Table  3.  Notations  lor  vertices 


symbol 

description 

b-dominating  vertex  set 

Vs 

supporting  vertex  set 

Vc 

eritieal  vertex  set 

Vnc 

non-critical  vertex  set 

3.2  A  Best  First  Re-coloring  Algorithm 

To  realize  non-greedy  search,  we  utilize  the  best  first  search  strategy,  which  has  been 
widely  utilized  in  A I  communities,  and  seleet  the  best  coloring  among  the  candidate 
colorings  of  a  graph  G(  V  E).  From  the  graph,  a  pre-defined  number  of  vertices  are 
selected  according  to  the  descending  order  of  cla(v,  c(v))  in  eq.(  1)  where  v  E  V.  Here, 
d0(vyc(v))  can  be  interpreted  as  to  what  extent  the  vertex  r  is  an  “outlier”  for  the 
currently  assigned  cluster  r(r).  Thus,  we  seleet  the  vertex  with  the  largest  da(v,  c(v)) 
so  that  it  can  be  moved  into  the  other  cluster  via  re-coloring.  The  color  of  the  selected 
vertex  is  re-colored  in  order  to  increase  the  quality  of  partition  as  long  as  the  constraints 
in  b-eoloring  are  satished.  The  above  processes  are  repeated  for  the  specified  number 
of  iterations. 

Only  critical  vertices  were  utilized  for  estimating  the  quality  of  the  partition  in  [6]. 
However,  this  ean  result  in  unreliable  quality  estimation  and  misguide  search  directions. 
To  alleviate  this  problem,  we  utilize  all  the  vertices  for  quality  estimation  so  that  the 
algorithm  works  as  an  any-time  algorithm  and  that  only  better  partitions  are  returned. 
Note  that  when  the  eolor  of  a  non-critical  vertex  is  re-eolored.  some  eritieal  vertices 
can  become  non-eritical,  and  vice  versa.  Thus,  after  re-coloring  of  a  vertex,  the  status 
of  vertices  is  cheeked  and  reflected  in  the  following  re-coloring  process. 

The  proposed  algorithm  BFReColoring  is  summarized  in  Algorithm  1.  In  Algo¬ 
rithm  1,  b  stands  for  the  branching  number,  /  for  the  number  of  iterations.  One  re-colored 
partition  is  obtained  at  line  6  by  ealling  ReColoring  in  Algorithm  2.  In  Algorithm  2, 
the  vertex  with  the  largest  average  dissimilarity  is  selected  at  line  4.  The  color  with  the 
best  quality  is  assigned  for  the  vertex  at  line  7  via  ReColoring  in  Algorithm  2. 

1  N(v<i)  returns  the  set  of  adjacent  vertices  to  lhc  vertex  v(r 

2  \  denotes  set  difference. 
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Algorithm  1.  BFReColoring 
Require:  G(V ,  E) 

Require:  P  H  a  b-coloring  partition  of  G(V,  E) 

Require:  b  H  the  branching  number 
Require:  /  //  the  number  of  iterations 
1 :  V  searched  <=  0;  Vcand  <=  9:  II  V  represents  a  set  of  partitions 
P current  <^~  P 

3:  for  i  <=():«</;  i++  do 

4:  P searched  P searched  U  { P current} 

5:  for  j  <=  0;  (j  <  b):  j  -f  +  do 

6:  P*  <=  veColorin<j(P cuvrent  *P searched)  //  call  reColoring  in  Algorithm  2 

P rand  P cand  U  {P  } 

8:  end  for 

9.  P searched  P searched  U  P cand 

10:  P  current  <=  arg  max  q(P')  I  I  q(j  evaluates  the  quality  of  a  partition 

P'ePcand 

II*  P cand  ^  P cand  \  {P current } 

12:  end  for 

13:  return  arg  max  q(P') 

P'c'P 

1  t '  searched 


For  a  graph,  computation  of  da(vi,  c(v{))  for  all  the  vertices  can  be  conducted  in 
0(n2)  at  the  beginning  and  it  can  be  updated  in  O(n)  at  line  4  in  Algorithm  2.  By  de¬ 
noting  the  time  complexity  of  quality  evaluation  <7(-)  as  ;r\  line  7  takes  at  most  O(Ap) 
since  |Cj,(t;*)|  <  A  +  1.  Algorithm  2  can  take  0(n(n  +  A2p))  in  the  worst  case  when 
both  while  loops  at  lines  3  and  6  are  exhaustively  iterated.  However,  this  is  rather  too 
pessimistic  estimation,  since  these  while  loops  are  for  avoiding  the  duplicated  parti¬ 
tions.  Thus,  in  most  cases  the  most  expensive  process  (line  7)  is  called  only  once  in 
Algorithm  2.  Thus,  complexity  of  Algorithm  1  can  be  considered  as  O(blAp)* 4. 

3.3  A  Color  Exchange  Algorithm 

If  the  color  of  a  critical  vertex  is  changed,  the  number  of  b-dominating  vertices  will  de¬ 
crease.  Since  b-dominating  vertices  are  considered  as  useful  for  sustaining  inter-cluster 
separation,  re-coloring  was  conducted  only  on  non-critical  vertices  in  [6]. 

Although  it  is  difficult  to  re-color  critical  vertices  sequentially  without  decreasing  the 
number  of  b-dominating  vertices,  this  problem  can  be  resolved  if  more  than  one  vertices 
are  re-colored  simultaneously.  As  a  first  step,  we  propose  a  color  exchange  algorithm 
for  critical  vertices.  We  define  that  two  adjacent  critical  vertices  are  color  exchangeable 
if  the  following  three  conditions  are  satisfied. 

Definition  1  (Color  Exchangeable).  For  a  graph  and  its  partition  ( coloring )  P,  let  P  * 
be  the  coloring  by  exchanging  the  colors  of  two  adjacent  critical  vertices.  If  P  ’  satisfies 
the  followings ,  these  vertices  are  called  color  exchangeable: 

*  Dunne  can  be  calculated  in  G(n2)  and  updated  in  O(n)  for  re-coloring  of  a  vertex. 

4  Admittedly,  0(bl(n(n  +  ^i2p)))  in  the  worst  case  in  standard  notation. 
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Algorithm  2.  reColoring 

Require:  Pr , ,  rr«n  t //  a  b-coloring  partition  of  G(  V ,  E ) 

Require:  *P  searched  H  a  set  of  searched  partitions 
1 .  P  P current  H  copy  the  coloring  (partition) 

2:  Vf  <=  VTlc  //  candidate  vertices  in  P* 

3:  while  V'  ^  0  do 

4:  r*  :=  arg  max  dn(u,  c(v)) 

,-eV' 

5:  C'/  <=  0  //  C/  stores  the  tested  colors  of  v 

6:  while  ^0do 

7;  c*(t/)  <=  arginax  q(P(v* ,  c))  1/  P(v\c)  is  a  partition  w  ith  color  c for  v* 

r.ecp(v*)\C' 

8*  C,/  <=  Cr/  U  {c  (r*)}  //  <?*/</  /o  the  already  tested  colors 

9:  re-color  c(t>*)  to  c*(v*)  in  P% 

10.  if  P  $  P  searched  then 

1 1 .  return  P’  H  return  the  re-colored  partition 

12:  end  if 

13:  re-color  c*(vm)  back  to  the  original  c(r*)  in  P'  H  P'  was  already  searched 

14  end  while 

15:  V'  <=V'  \{r*} 

16:  end  while 
17:  return  0 


/.  all  the  adjacent  vertices  have  different  colors, 

2.  the  number  of  b-dom mating  vertices  is  not  decreased . 

3.  the  number  of  colors  is  not  decreased. 

Currently,  candidate  vertices  for  color  exchange  arc:  I )  a  b-dominating  vertex  r,  and  its 
supporting  vertex  Vj*  or,  2)  for  some  b-dominating  vertex,  its  two  supporting  vertices  Vj 
and  Vk .  For  a  b-donunating  vertex  i\ ,  a  supporting  vertex  v3  is  the  only  vertex  with  color 
c(i'j)  in  Thus,  if  the  color  r(rt)  is  different  from  the  neighboring  colors  of  v3, 

exchanging  their  colors  docs  not  decrease  the  number  of  b-dominating  vertices.  Sim¬ 
ilarly,  for  a  b-dominating  vertex,  exchanging  the  colors  of  its  two  supporting  vertices 
does  not  decrease  the  number  of  b-dominating  vertices. 

The  proposed  algorithm  ExColors  is  summarized  in  Algorithm  3.  For  the  selected 
vertex,  at  line  5  the  candidate  vertices  for  color  exchange  are  enumerated  using  ExVer- 
tices  in  Algorithm  4.  Color  exchange  is  conducted  if  a)  the  pair  of  critical  vertices  are 
color  exchangeable,  and  b)  the  quality  of  partition  would  be  improved  (line  9),  If  there 
are  more  than  one  vertex  for  exchange,  the  vertex  with  the  maximum  average  dissimi¬ 
larity  is  selected. 

As  in  Algorithm  2,  selection  of  a  vertex  can  be  conducted  in  O(n)  at  line  3  in  Algo¬ 
rithm  3.  Since  up  to  two-step  neighboring  vertices  for  the  selected  vertex  are  checked  in 
Algorithm  4,  at  most  0(A2)  vertices  are  obtained  as  the  candidates.  Thus,  the  overall 
time  complexity  of  Algorithm  3  is  0(u(n  -f  zl2p))5. 

5  As  in  Section  3.2,  time  complexity  of  </(•)  is  denoted  as  p. 
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Algorithm  3.  ExColors 

Require:  G(V ,  E)  HA  graph  which  a  set  of  vertices  and  a  set  of  edges 
Require:  Plla  partition  which  is  a  b-coloring  of  G(V  ,E) 

1:  V'  <=  Vc 
2:  while  V'  ^  0  do 

3:  Vi  :=  arg  max  da  (vi ,  c(vi ) ) 

vteV' 

4:  V'<t=V'\{vi} 

5:  Vex  :=Ex  Vertices  ((7,  P,  u,)  //  call  Ex  Vertices  in  Algorithm  4 

6:  while  Vfx  ^  0  do 

7:  vj  arg  max  dn(vj ,  c{v3 ) ) 

vj  ^  Wse 

8:  v:x^v:x\{v3] 

9:  if  (c(rfi)  and  c(uj)  is  exchangeable)  A  (q{P)  <  q(P  (exchan  ge(c(vt),  c(vj)))  then 

10:  exchange  color  r(t;t)  and  c(vj)  in  P 

//  the  colors  of  vt  and  Vj  are  exchangeable  and  the  quality  would  be  improved 
II:  V'  <=  VCi  break 

12:  end  if 

13:  end  while 

14:  end  while 
15:  return  P 


3.4  Working  Examples 

As  shown  in  Seetion  2.4,  for  the  graph  and  its  coloring  in  Fig.  2,  the  colorings  in  Fig.  4 
(Dunne  =  1.500)  and  Fig.  5  (Dunne  =  1.000)  are  obtained  b>  BFReColoring  and 
ExColors,  respectively.  Non-eritieal  vertices  a,  e,  f,  i  were  rc-eolored  in  Fig.  4.  The 
colors  of  critical  vertices  b  and  h  were  exchanged  in  Fig.  5. 

Furthermore,  critical  vertices  are  considered  for  color  exchange  in  ExColors;  on  the 
other  hand,  non-critieal  vertices  arc  re-colored  in  BFReColoring.  Since  Vc  FI  Vric  ~ 
these  are  mutually  independent  and  orthogonal  with  respect  to  the  re-colored  vertices. 
Thus,  these  can  be  utilized  in  conjunction.  The  coloring  in  Fig.  6  (with  Dunne—  1 .750) 
is  obtained  by  applying  both  algorithms.  In  this  example,  critical  vertices  b  and  h,  a 
non-critical  vertex  f  were  re-colored.  Thus,  the  proposed  approach  enables  to  obtain 
better  partitions  (colorings)  by  enlarging  the  search  space  via  non-greedy  search  and 
re-coloring  of  critical  vertices. 

4  Preliminary  Evaluations 

4.1  Evaluation  Measures 

In  addition  to  Dunne  in  eq.(4),  we  also  evaluated  a)  micro-averaged  Precision,  and  b) 
distinctness,  of  a  partition.  As  in  Dunne,  the  larger  the  evaluated  value  is,  the  better 
the  partition  is. 

Micro-Averaged  Precision.  Micro-averaged  precision  is  a  widely  utilized  measure 
in  information  retrieval  community  [1].  Based  on  the  cross  table  of  true  clusters  and 
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Algorithm  4.  ExVertiees 

Require:  G(V ,  E)  HA  graph  which  a  set  of  vertices  and  a  set  of  edges 
Require:  P  Ha  partition  which  is  a  b-coloring  ofG(V.E) 

Require:  i», :  a  vertex 
1:  if  Vi  €  Vd  then 

2:  Vjx  :=  supporting  vertices  of  vt 

3:  else  if  vt  G  Va  then 
4:  for  each  vn  c  N(v,)  do 

5:  if  (r„  C  Vd)  A  (r,  is  a  supporting  vertex  of  rri)  then 

6:  Vr'x  {?’n  }U  supporting  vertices  of  vT, 

7:  end  if 

8:  end  for 

9:  end  if 
10:  return  V' 


assigned  clusters,  it  is  calculated  by  averaging  the  precision  of  data  assignment  to  each 
constructed  cluster.  Please  refer  to  1 1 1  for  the  details.  We  call  this  Precision  hereafter. 

Distinctness.  The  variance  of  the  distribution  match  between  clusters  C h  and  Gi  in  a 
partition  is  defined  as: 

i  P 

Var(Cu,Ct)  =  -  £  £(P(a,-  =  *0-|Cfc)  -  P(a,  =  t.jIQ))2  (5) 
”  *  j 

where  p  is  the  number  of  attributes.  P(a,*  •*’/.)  |G)  represents  the  conditional  proba¬ 

bility  of  attribute  at  taking  the  value  .rtJ  in  cluster  C/. 

The  distinctness  of  a  partition  P  is  defined  as  the  average  variance  1 1 4 ] : 

£  £  v,„i ch.c,) 

Di’,{P)  =  "  'iVkip,  ■  „ 


4.2  Experimental  Settings 

Preliminary  evaluations  were  conducted  over  several  LJC1  datasets  1 1  1].  The  utilized 
datasets  were:  Zoo  (101  data,  7  labels).  Teaching  Assistant  Evaluation  (tae)  (151  data, 
3  labels),  and  Protein  Localization  Sites  (eeoli)  (336  data,  8  labels).  In  all  the  datasets, 
each  data  item  has  its  true  class  label.  The  true  class  labels  are  regarded  as  “ground 
truth”  and  utilized  to  calculate  Precision.  After  normalizing  each  attribute  to  [0,1 1  as 
in  Weka  [17]),  dissimilarities  between  data  items  were  calculated  using  the  standard 
Euclidian  distance. 

The  proposed  algorithms  (with  BFReColoring,  with  ExColors,  with  both  of  them) 
were  compared  with  the  following  clustering  algorithms:  1 )  previous  re-eoloring  algo¬ 
rithm  [61,  2)  kmeans  algorithm  [10|,  and  3)  EM  algorithm  [3|.  Weka  [17])  was  used 
for  kmeans  and  EM.  Since  kmeans  and  EM  require  the  number  of  clusters,  the  true 
number  of  clusters  was  specified  for  each  dataset. 
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Table  4.  Result  (/oo  dataset) 


Free 

Dist 

Dumig 

BF 

0.812 

0.479 

1 .050 

Ex 

0.743 

0.577 

0.796 

Ex+BF 

0.812 

0.479 

1 .050 

greedy 

0.733 

0.401 

0.910 

kmeans 

0.72.1 

0.489 

1.014 

EM 

0.673 

0.591 

0.981 

Table  5.  Result  (tae  dataset) 


Free 

Dist 

During 

BF 

0.444 

0.363 

0.983 

Ex 

0.430 

0.355 

0.923 

Ex+BF 

0.444 

0.363 

0.983 

greedy 

0.430 

0.530 

1 .350 

kmeans 

0.517 

0.458 

1.132 

EM 

0.404 

0.321 

'  1.141 

Table  6.  Result  (ecoli  dataset) 


Free 

Dist 

Dunug 

BF 

0.631 

0.444 

0.831 

Ex 

0.59 1 

0.449 

0.653 

Ex+BF 

0.631 

0.444 

0.831 

greedy 

0.324 

0.419 

1.004 

kmeans 

0.613 

0.153 

0.609 

EM 

0.619 

0.168 

0.604 

Following  the  experimental  setting  in  [6],  the  same  graph  and  its  partition  (coloring 
of  the  graph)  were  given  to  the  proposed  algorithms  and  1 ),  and  Dumic{')  was  used  as 
the  quality  measure  </(•)  in  the  algorithms.  In  Algorithm  1,  b  was  set  to  10  and  /  was  set 
to  103.  The  threshold  for  defining  the  graph  structure  was  set  so  that  the  same  number 
of  colors  (clusters)  was  obtained  in  each  dataset. 

4.3  Results 

The  results  are  summarized  in  Tables  4,  5,  6.  In  the  tables,  BF  stands  for  BFReColor- 
ing  (Algorithm  1),  Ex  stands  for  ExColors,  (Algorithm  3),  Ex+BF  stands  for  applying 
ExColors  and  BFReColoring  in  this  order,  greedy  stands  for  the  algorithm  in  [6]. 

The  results  show  that  the  proposed  algorithms  outperform  the  other  algorithms  in 
most  eases  w.r.t.  Precision.  Since  the  evaluation  based  on  the  true  class  label  is  consid¬ 
ered  as  the  so-ealled  “ground  truth"  evaluation,  the  results  indicate  that  the  proposed 
approach  is  promising  toward  improving  re-coloring  based  clustering. 

Intuitively,  Dist(-)  in  eq.(6)  evaluates  to  what  extent  the  obtained  clusters  differ  w.r.t. 
the  prediction  of  the  attribute  value.  The  results  varied  depending  on  the  datasets  and  it 
is  difficult  to  draw  a  decisive  conclusion  w.r.t.  distinctness  from  the  results. 

Dunna(')  in  eq.(4)  was  used  as  the  quality  measure  q(-)  in  our  algorithms  and 
greedy.  These  algorithms  improved  this  quality,  and  the  values  were  larger  than  those 
obtained  by  kmeans  and  EM  (except  for  tae  dataset).  However,  in  Table  6,  greedy 
returned  the  largest  Dunnc(-)  value,  but  it  is  actually  the  worst  w.r.t.  Precision. 

4.4  Discussion 

Results  in  Section  4.3  indicate  that  the  proposed  approach  is  effective  for  improving 
the  performance  of  re-eoloring  based  clustering  in  terms  of  Precision.  Unfortunately, 
Precision  cannot  be  utilized  as  the  quality  measure  q(-)  in  any  algorithms  directly , 
since  it  is  calculated  based  on  the  “true"  labels.  Note  that  “true"  labels  are  unavail¬ 
able  for  clustering  or  in  unsupervised  learning  in  general.  On  the  other  hand,  eq.(4)  ean 
be  calculated  only  from  the  available  data  and  thus  ean  be  utilized.  It  is  not  yet  clear 
how  the  latter  correlates  with  Precision.  In  addition,  the  performance  of  BF  and  Ex+BF 
was  the  same  for  these  datasets.  Much  more  work  needs  to  be  conducted  for  investigat¬ 
ing  the  usage  of  dissimilarity  information  in  the  algorithm. 
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5  Conclusion 

This  paper  has  proposed  an  approach  toward  improving  re-eoloring  based  clustering 
with  graph  b-coloring.  Based  on  the  notion  of  b-coloring  in  graph  theory  [  12],  cluster¬ 
ing  algorithms  were  proposed  in  prev  ious  approach,  however,  these  were  still  restrictive 
in  terms  of  the  explored  search  spaee  due  to  its  greedy  and  sequential  re-eoloring  pro¬ 
cess.  In  this  paper  a  best  first  re-eoloring  algorithm  was  proposed  to  realize  non-greedy 
search  for  the  admissible  colors  of  vertices.  A  color  exchange  algorithm  was  proposed 
to  remedy  the  problem  in  sequential  re-coloring.  Both  algorithms  enlarge  the  seareh 
spaee  and  re-color  the  vertices  of  a  graph  to  improve  the  quality  of  clusters,  while  guar¬ 
anteeing  the  property  of  b-eoloring.  In  addition,  these  algorithms  are  orthogonal  with 
respeet  to  the  re-eolored  vertiees  and  thus  can  be  utilized  in  conjunction. 

Preliminary  evaluations  were  eondueted  over  several  UC1  datasets.  The  results  are 
encouraging  for  pursuing  this  line  of  research,  especially  for  obtaining  better  clusters 
with  respeet  to  the  ground  truth  micro-averaged  precision.  However,  with  respect  to 
other  clustering  validations  indiees,  it  was  rather  comparable  to  other  approaches  and 
could  not  always  outperform  them.  We  plan  to  eonduet  more  evaluations  and  investigate 
the  suitable  quality  measure  to  guide  the  seareh  proeess  in  the  proposed  algorithms. 
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Abstract.  This  paper  presents  a  methodology  for  expert-guided  analy¬ 
sis  of  large  data  sets,  including  large  text  corpora.  Its  main  ingredient  is 
the  algorithm  for  semi-supervised  data  clustering  using  cluster  size  con¬ 
straints  which  implements  several  improvements  over  existing  k-means 
constrained  clustering  algorithms.  First,  it  allows  for  a  larger  set  of  user- 
defined  cluster  size  constraints  of  different  types  (lower-  and  upper-bound 
constraints).  Second,  it  allows  for  dynamic  re-assignment  of  predefined 
constraints  to  clusters  in  iterative  cluster  computation  optimization,  thus 
improving  the  results  of  constrained  clustering.  Third,  it  allows  for  expert  - 
guided  cluster  optimization  achieved  by  combining  constrained  clustering 
and  data  visualization,  which  enables  finer-grained  expert’s  control  over 
the  clustering  process,  leading  to  further  improvements  of  the  quality  of 
obtained  clustering  solutions.  Incorporating  data  visualization  into  the 
clustering  process  allows  the  user  to  select  referential  points  which  act  as 
constraint  anchors  in  the  course  of  iterative  cluster  computation.  The  pro¬ 
posed  semi-supervised  constrained  clustering  methodology  lias  been  im¬ 
plemented  using  a  service-oriented  data  mining  environment  Orange4WS 
and  evaluated  on  different  document  corpora. 


1  Introduction 

Clustering  is  a  method  of  unsupervised  learning,  aimed  at  assigning  a  set  of 
data  instances  into  subsets  called  clusters  so  that  instances  in  the  same  cluster 
are  similar  according  to  a  predefined  similarity  measure.  K-means  clustering  [8] 
has  proven  to  be  an  effective  tool  both  in  data  and  text  mining.  In  text  mining, 
k-means  clustering  is  being  used  extensively  for  exploratory  text  analysis  includ¬ 
ing  concept  identification  |9|  and  document  corpora  visualization  [14].  Although 
widely  used  because  of  its  speed  and  simplicity,  the  k-means  clustering  algorithm 
and  its  variants  have  some  serious  drawbacks  which  limit  their  use  in  specific 
scenarios. 

The  most  popular  version  of  k-means,  i.e.  the  Forgy’s  algorithm  |1,8]  is  known 
to  produce  unbalanced  and/or  empty  clusters  when  applied  to  datasets  with  a 
high  number  of  dimensions  and  a  large  number  of  clusters  [3,8].  For  example, 
clustering  of  Web  browsing  data  |3|  with  300  dimensions  (features)  resulted  on 
average  in  4.1  and  12.1  empty  clusters  where  k  was  set  to  50  and  100.  respectively. 
More  generally,  this  phenomenon  was  observed  when  clustering  data  with  the 
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number  of  dimensions  n  >  10,  where  the  number  of  desired  clusters  was  set  to 
k  >  20  |3).  The  problem  of  empty  and  unbalanced  clusters  has  been  addressed 
in  the  area  of  constrained  clustering,  briefly  introduced  below. 


1.1  Constrained  Clustering 

Constrained  clustering  is  a  class  of  semi-supervised  learning  algorithms  which 
can  be  divided  into  two  main  groups.  Clustering  algorithms  with  instance-based 
constraints  typically  incorporate  a  set  of  must-link  constraints  and/or  eannot- 
link  constraints  |19|.  Clustering  algorithms  with  cluster-based  constrains  [3, 18] r 
on  the  other  hand,  incorporate  constraints  concerning  the  size  or  shape  of  in¬ 
dividual  clusters.  In  order  to  address  the  problem  of  empty  dusters  mentioned 
above,  Bradley,  Bennett,  and  Demiriz  [3]  proposed  a  constrained  clustering  al¬ 
gorithm,  explicitly  adding  k  constraints  to  the  underlying  optimization  problem 
which  state  that  each  cluster  h  should  contain  at  least  points.  By  integrating 
these  constraints  into  the  optimization  procedure,  they  present-  a  clear,  mathe¬ 
matically  well-formed  solution  which  can  be  also  generalized  to  other  constraints 
(e.g.  outlier  removal  or  specific  groupings). 

In  this  paper,  we  present  a  method  for  semi-supervised  constrained  data  clus¬ 
tering  using  cluster  size  constraints,  upgrading  the  k-means  clustering  method. 
To  do  so,  we  first  briefly  present  the  k-means  algorithm  with  additional  con¬ 
straints.  For  the  sake  of  clarity,  the  same  notation  as  in  |3]  is  used  throughout 
this  introductory  section. 

Lot  V  =  {.r*.  i  =  1,  ...,m}  bo  a  dataset  in  Rrj  and  k  the  desired  number 
of  clusters.  Then,  the  problem  of  k-means  clustering  is  to  find  cluster  centers 

CX.C2 . Ck  where  the  sum  of  the  squared  error1  (SSE)  is  minimized  [17]. 

More  formally,  this  can  be  written  as: 


min 

cl . ck 


Z 


min  (list  (:r\  Ch) 

/i= i k 


(i) 


This  equation,  however,  can  be  reformulated  into  an  equivalent  form  where  bi¬ 
nary  selector  variables  are  introduced.  These  variables  indicate  the  mem¬ 
bership  of  data  points  to  clusters:  Tljl  =  1  if  data  point  xl  is  closest  to  center 
Ch  and  zero  otherwise.  The  reformulated  Eq.  1  is  then  as  follows  |3]: 

minimize  Ti.h  ' dis1  (*’.  Ch) 

where  *  =  1,  »'  =  1,  •  •  • .  m 

Tjjt  >0,  i  =  1 , . . . ,  m;  h  =  1, . . . ,  k 


The  proof  that  the  new  equation  with  selector  variables  is  equivalent  to  the 
original  can  he  found  in  |4]  as  Lemma  2.1.  Note  that  it  is  possible  for  the  k- 
means  algorithm  to  produce  one  or  more  empty  clusters  i.c.  X^=i  =  0  as 
such  a  solution  satisfies  the  Karush-Kuhn-Tucker  (KKT)  conditions  |17|  for  Eq. 


1  SSE  is  also  know  as  scatter. 
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2.  The  constrained  k-means  algorithm  can  bo  now  formalized  by  adding  the 
following  cluster  size  constraints  to  Eq.  2: 

m 

'l\j,  >  Th  (3) 

Values  represented  by  r/,  arc  constants  specified  in  advance  by  the  user.  In 
plain  terms,  each  cluster  h  must  contain  at  least  r/,  data  points.  To  assure 
that  the  constructed  optimization  problem  is  solvable  we  add  a  sanity  condi¬ 
tion  Yin  i  Th  —  rn  which  states  that  the  sum  of  all  size  constraints  is  not  larger 
than  the  size  of  the  observed  set  of  data  instances. 

Finally,  the  constrained  k-nieans  algorithm  is.  like  the  classic  k-means,  defined 
as  an  iterative  two-step  procedure  which  iterates  between  solving  the  linear  pro¬ 
gram  defined  by  Eq.  2  and  3  to  obtain  values  for  selector  variables  T,j,  (cluster 
assignment  step),  followed  by  updating  the  cluster  centers  Ch  (cluster  update 
step).  As  a  last  remark  on  the  constrained  k-rneans  algorithm  the  following 
statements  were  proven  to  be  true  |3|: 

1.  rI  lie  constrained  k-means  algorithm  terminates  in  a  finite  number  of  iteration 
in  a  locally  optimal  cluster  assignment. 

2.  The  cluster  assignment  sub-problem  (step  1  of  eac  h  iteration)  is  equivalent 
to  the  Minimum  Cost  Flow  (MCF)  network  optimization  problem. 

3.  According  to  statement  (2)  above  and  [2|  the  optimal  flow  of  the  equivalent 
MCF  problem  is  integer- valued  which  means  that  the  optimal  binary  values 
for  Tij,  can  be  obtained  without  explicitly  declaring  them  as  integer  thus 
solving  the  integer  programming  problem  (ILP)  which  is  know  to  belong  to 
the  NT  —  hard  class  of  problems  |11|. 

1.2  Summary  of  Research  Advances  and  Paper  Outline 

The  main  contribution  of  this  paper  is  a  methodology  for  expert-guided  analysis 
of  large  data  sets,  including  large  text  corpora.  Its  main  ingredient  is  the  al¬ 
gorithm  for  semi-supervised  data  clustering  using  cluster  size  constraints  which 
successfully  eliminates  some  limitations  of  existing  k-means  constrained  clus¬ 
tering  algorithms.  First,  it  allows  for  a  larger  set  of  user-defined  cluster  size 
constraints  of  different  types  (lower-  and  upper-bound  constraints).  Second,  it 
allows  for  dynamic  re-assignment  of  predefined  constraints  to  clusters  in  itera¬ 
tive  cluster  computation  optimization.  Third,  it  allows  for  expert-guided  cluster 
optimization.  The  proposed  semi-supervised  constrained  clustering  algorithm  is 
presented  in  Section  2.  Expert-guidance  is  achieved  by  combining  constrained 
clustering  and  data  visualization,  which  enables  finer-grained  expert's  control 
over  the  clustering  process,  leading  to  further  improvements  of  the  quality  of 
obtained  clustering  solutions.  Incorporating  data  visualization  into  the  cluster¬ 
ing  process,  as  described  in  Section  3,  allows  the  user  to  explore  the  data  and 
to  select  referential  points  representing  initial  cluster  centroids,  which  (in  the 
simplest  scenario)  act  also  as  constraint  anchors  in  the  course  of  iterative  cluster 
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computation.  The  proposed  semi-supervised  constrained  clustering  methodol¬ 
ogy  has  been  implemented  using  a  service-oriented  data  mining  environment 
0range4WS  [15).  The  evaluation  of  the  methodology  on  various  text  corpora 
is  presented  in  Section  4.  The  paper  concludes  with  a  summary  and  plans  for 
further  work. 

2  Semi-supervised  k-Means  with  Cluster  Size  Constraints 

The  constrained  variant  of  the  k-means  algorithm,  presented  in  Section  1,  is  not 
appropriate  if  a  certain  cluster  size  constraint  needs  to  be  assigned  to  a  cluster 
with  specific  semantics  (e.g.  cluster  containing  documents  discussing  a  certain 
topic).  During  the  clustering  process,  cluster  centers  “travel”  in  the  direction, 
opposite  to  the  gradient  of  the  target  function  in  the  observed  space  which 
means  that  each  specified  constraint  r ^  will  apply  to  an  unknown  part  of  the 
space  with  input  data.  The  only  applicable  scenario  (which  was  also  addresed 
by  the  authors  |3])  is  the  case  with  balanced  constraints  where  all  r ^  are  equal, 
(in  this  special  case  one  is  not  concerned  with  the  size  of  an  individual  cluster, 
the  objective  is  just  to  eliminate  empty  or  very  small  clusters). 

Therefore,  we  propose  a  modified  algorithm  (Algorithm  1),  which  is  able  to 
overcome  the  indicated  problem.  To  this  end,  our  variant  of  the  algorithm  main¬ 
tains  points  of  reference  with  respect  to  the  given  constraints  and  modifies  the 
optimization  problem  specifications  accordingly.  It  should  be  noted,  however, 
that  the  new  variant  requires  certain  amount  of  domain  knowledge  (user’s  back¬ 
ground  knowledge)  in  order  to  be  applied  successfully.  In  the  context  of  clustering 
document  corpora,  which  is  the  target  domain  of  this  paper,  such  knowledge  can 
be  provided  by  visualization,  as  shown  in  Section  3  below. 

The  idea  of  the  proposed  algorithm  is  the  following.  In  order  to  apply  con¬ 
straints  to  specific  parts  of  the  data  space,  there  have  to  exist  the  same  number 
of  reference  points,  one  for  each  constraint.  Each  such  reference  point  character¬ 
izes  the  part  of  the  input  data  space  where  the  constraint  should  be  enforced. 
However,  as  cluster  centroids  tend  to  travel  through  the  data  space  during  the 
clustering  process,  the  constraints  are  likely  to  be  applied  to  a  completely  dif¬ 
ferent  part  of  the  space  than  the  initial  data  subspace.  For  this  reason,  our 
algorithm  recomputes  distances  between  reference  points  and  the  current  cen¬ 
troids  in  each  iteration  and  reassigns  constraints  when  necessary.  The  proposed 
modification  of  the  constrained  k-ineans  algorithm  is  presented  as  Algorithm  1. 

Note,  however,  that  reassignments  of  constraints  modifies  the  underlying  o{>- 
tirmzation  problem  which  introduces  the  possibility  of  cycling  where  clusters 
exchange  constraints  without  converging  their  centroids  to  final  positions  (the 
Proposition  1  from  Section  1  stating  that  the  algorithm  finishes  in  a  finite  num¬ 
ber  of  steps  no  longer  holds).  Although  such  situations  are  very  unlikely  to  occur 
in  high  dimensional  data  spaces  with  good  initial  centroids,  a  solution  in  these 
rare  cases  is  to  employ  simulated  annealing  with  a  simple  cooling  schedule  or  to 
introduce  small  random  jitter  of  centroids.  As  already  stated,  a  set  of  reference 
points  P,  representing  both  initial  cluster  centroids  and  reference  points  for  con- 
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Algorithm  1 

Input: 

data  set  in  K"  with  m  instances:  T>  =  {x* ,  m} 

desired  number  of  clusters:  k 

—  set  of  constraints:  r  =  (n _ ,  n  } 

—  set  of  reference  points  in  R" :  V  =  {p*,  i  =  ,1 . A*}  which  is  also  the  set  of  initial 

centroids:  C°  =  {Cl  {\  i  =  1, . . . ,  k} 

Output  of  iteration  t: 

—  assignment  of  input  data  instances  to  clusters  with  respect  to  given  constraints 

—  set  of  cetroids:  Cl  =  {CM,  i  =  I, . . . ,  A:} 

Each  iteration  t  of  the  algorithm  consists  of  three  steps: 

1.  Cluster  assignment. 

Solve  a  linear  program  to  obtain  the  values  of  selector  variables  Tfth: 

minimize  T‘m  ■  (list  (.r1 .  Ch'' ) 

where  T.h=\T!.h  =  >•  '=  > . «« 

T[h  >  0,  i  =  I,..., tw;  It  =  l . k 

E’’i,  TU  >**/»  =  1 . k 

2.  Cluster  update. 

Compute  new  centroids  for  the  next  iteration  i  l: 


ph.t+l 


I"  i  r!j, ,j"' 

EJ"  ,  TfJ, 

ChJ 


if  ZT-iHj,  >0 

other  wii sv 


3.  Permutation  of  constraints. 

Assign  newly  computed  centroids  to  reference  points  pl  by  computing 

binary  selector  variables  V}[ t 1  so  that  the  total  distance  is  minimized,  and  permute 
assignment:  of  constraints  to  clusters  accordingly: 


minimize  ££ 


*/  Vft’w  ,r>" 

otherwise 


) 


straint.s,  is  required  as  input  to  the  algorithm.  In  order  to  specify  these  points, 
the  user  needs  to  have  an  understanding  of  the  underlying  data.  To  provide  the 
user  with  a  better  understanding  of  the  underlying  data,  we  employ  a  feature 
space  visualization  algorithm  based  on  least-squares  meshes  |1(>,14],  described 
m  Section  3  below.  Through  data  visualization,  the  user  is  able  to  anchor  the 
cluster  size  constraints  to  specific  parts  of  interest  in  the  data  space. 
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3  Methodology  for  Expert-Guided  Constrained 
Clustering  Facilitated  by  Data  Visualization 

This  section  presents  the  proposed  semi-supervised  constrained  clustering 
methodology.  In  addition  improving  the  constrained  clustering  algorithm 
(Algorithm  1),  the  main  additional  assets  used  in  this  process  are  data  visualiza¬ 
tion  and  user-guided  constrained  clustering  through  an  interface  to  the  visualized 
data  clouds,  enabling  initial  centroid  selection  and  size  constraints  specification. 
This  section  first,  outlines  the  steps  of  the  proposed  methodology,  followed  by 
presenting  the  algorithm  which  enables  the  visualization  of  the  data  space,  more 
specifically,  a  document  space  using  the  bag-of-words  document  representation. 
In  the  context  of  this  paper,  the  process  of  visualization  is  seen  as  a  procedure 
which  extracts  knowledge  about  the  underlying  structure  of  the  data.  The  visu¬ 
alization  method  should  namely  be  able  to  provide  enough  information  to  guide 
the  expert  when  specifying  the  constraints  and  should  also  (implicitly)  help  the 
clustering  algorithm  to  converge  faster  by  providing  good  initial  centroids.  The 
document  corpora  visualization  method  presented  in  this  section  is  a  combi¬ 
nation  of  multidimensional  scaling,  least-squares  solver,  and  internal  k-means 
clustering. 

3.1  Methodology 

The  proposed  semi-supervised  constrained  clustering  methodology  consists  of 
the  following  main  steps: 

1 .  The  input  data  is  preprocessed  as  required  by  the  clustering  and  visualization 
algorithms. 

2.  The  least-squares  rneshes  data  visualization  algorithm  is  invoked.  As  a  result, 
the  user  is  presented  with  a  2D  projection  of  high-dimensional  data  inst  ances, 
such  as  the  one  presented  in  Figure  la. 

3.  The  graphical  user  interface  of  our  algorithms  enables  the  user  to  visually 
identify  centers  of  condensed  groups  of  data  instances  and  to  anchor  con¬ 
straints  to  these  points,  called  reference  points  (visualized  as  triangles  in 
Figure  lb).  Furthermore,  the  user  defines  each  of  the  constraints  by  setting 
the  lower-  and  or  upper-size  limit  of  the  corresponding  cluster. 

4.  When  the  constraints  are  fully  specified,  the  constrained  clustering  algorithm 
is  invoked.  The  algorithm  takes  reference  points  (i.c.  constraint  anchors)  as 
the  initial  centroid  locations.  The  centroids  then  travel  around  the  spare 
during  the  optimization  process,  wdiile  the  reference  points  (and  thus  the 
constraints)  keep  their  initial  positions.  In  each  step  of  the  clustering  process, 
the  constraints  can  be  reassigned  to  centroids  (if  necessary)  according  to  the 
constraint  permutation  step  in  Algorithm  1. 

5.  The  algorithm  outputs  the  size-constrained  data  clusters  to  be  further  in¬ 
spected  (and  possibly  refined)  by  the  user. 
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Fig.  1.  2D  projection  of  high-dimensional  data  instances  of  the  Yahoo  Finance  dataset 
(a)  with  manually  selected  reference  points  (b) 

3.2  Visualization  of  Large  Document  Corpora 

A  bag-of-words  document  spate  is  a  high-dimensional  space  in  which  documents 
are  represented  as  feature  vectors  (TF-IDF  vectors).  To  visualize  the  bag-of- 
words  space,  we  need  to  project  feature  vectors  onto  a  2-dimensional  canvas 
so  that  the  distances  between  the  planar  points  reflect  the  cosine  similarities 
between  the  corresponding  feature  vectors. 

For  the  purpose  of  this  visualization,  we  followed  the  work  of  Sorkiiic  and 
Cohen-Or  [16]  and  Paulovich  et  al.  [14|  which  is  based  on  least-squares  meshes 
(for  this  reason,  we  use  the  term  least-squares  meshes  visualization  throughout 
this  paper).  To  compute  the  projection  of  high-dimensional  feature  vectors  onto 
a  planar  canvas,  several  methods  are  employed  in  a  pipeline.  Clustering  of  the 
feature  vectors  is  first  performed  to  obtain  several  smaller,  more  manageable 
segments  of  the  feature  space.  Then,  several  representative  instances  medoids 
of  the  obtained  clusters  an'  selected  and  their  layout  is  computed.  As  the 
number  of  representative  instances  r  is  much  smaller  than  the  number  of  feature 
vectors  n  (r  <$C  ft),  computationally  expensive  techniques  can  he  employed  for 
this  purpose.  In  our  ease,  stress  niajorization  [10]  is  employed  to  perform  this 
step  of  the  process.  After  the  representative  instances  are  positioned  in  2D.  a 
system  of  linear  equations  is  constructed  and  solved  in  the  least-squares  sense. 
The  solution  of  the  system  represents  the  projection  of  all  the  feature  vectors  onto 
a  planar  canvas.  To  construct  a  system  of  linear  equations,  planar  coordinates 
of  several  control  points  (obtained  by  the  the  stress  niajorization  algorithm)  and 
the  k  nearest  neighbors  of  each  instance  are  required.  In  Eq.  1,  Pt  denotes  a 
point  (both  coordinates)  and  Npt  the  set  of  its  nearest  neighbors  (note  t hat  a 
point  is  not  its  own  nearest  neighbor). 


(4) 
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Instances  of  Eq.  4  for  all  points  and  precomputed  positions  of  control  points  can 
be  expressed  as  a  system  of  sparse  linear  equations,  as  shown2  in  Eq.  5. 


A 

A' 


X  = 


0 

B 


(5) 


Here,  (sub)systein  AX  0  contains  instances  of  Eq.  4  and  (snb)system  A'X  B 
defines  (known)  positions  of  control  points.  Vector  X  represents  unknown  fi¬ 
nal  positions  of  points,  respectively.  Note  that  the  system,  defined  by  Eq.  5.  is 
overdetermined3.  As  such  systems  usually  have  no  solution,  the  goal  is  to  find  a 
“solution”  which  fits  the  equations  best  in  the  least  squares  sense.  In  our  docu¬ 
ment  stream  visualization  framework  the  LSQR  solver,  developed  by  Paige  and 
Saunders  [13],  was  used  to  obtain  the  solution  of  Eq.  5  which  is  a  set  of  planar 
points  corresponding  to  the  high-dimensional  feature  vectors. 


3.3  Implementation 

The  proposed  data  analysis  methodology  was  implemented  using  a  combina¬ 
tion  of  various  technologies  and  open  source  software  libraries.  Firstly,  the  data 
preprocessing  step  was  implemented  in  Python.  Secondly,  the  Orangc4WS  web 
service  environment  [15]  was  employed  to  invoke  web  services,  built  on  top  of 
the  LATINO  multilingual  text  mining  library4  which  provides  all  the  required 
components  to  produce  sparse  vector  representation  of  textual  data  and  their 
visualization:  tokenizers,  leminatizers/steininers,  11-gram  detection,  bag-of- words 
computation,  and  the  least-squares  meshes  visualization  method.  The  clustering 
algorithm  was  implemented  in  Python  using  the  numpy  package5  for  numer¬ 
ical  computations  and  Python  interface  to  the  Ip  solve  mixed  integer  linear 
programming  (MILP)  solver6 * 8  which  essentially  forms  the  backbone  of  the  con¬ 
strained  k-means  clustering  algorithm.  The  graphical  user  interface  was  written 
in  Python  using  cross-platform  open  source  framework  Qt'  and  its  extension  for 
technical  applications  named  Qwt*. 


4  Evaluation 

The  proposed  methodology  is  illustrated  through  constrained  clustering  tasks 

using  four  datasets:  the  Yahoo  finance  dataset9  with  6177  short  company  de¬ 
scriptions.  Inductive  Logic  Programming  (ILP)  dataset10  with  1407  scientific 

2  For  the  sake  of  clarity,  this  system  combines  all  dimensions  of  points  (vectors  X  and 
B  have  dimensions  [ n  x  2]).  In  practice,  we  have  to  solve  such  a  system  for  each 
dimension  separately. 

3  Over  determined  systems  of  linear  equations  have  more  equations  than  variables. 

4  http: / /source forge. net / projects /Iat  ino 

http://miiiipy.scipy.org 

6  http://lpsolve.sonrceforge.net/5.5/ 

http://qt.nokia.com/ 

8  http://qwt.sourceforge.net/ 

9  Available  at  http://ontogen.ijs.si/7page_id  10 

10  Available  at  http://www.cs.bris.ac. nk/^ILPnet2/ 
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publications  out  of  which  506  contain  both  titles  and  abstracts  and  the  rest 
contain  titles  only,  and  a  corpus  containing  the  Proceedings  of  the  Slovenian 
Informatics  Conference11  (DSI)  from  2003  to  2009  with  833  texts  in  Slovene 
language  (the  use  of  this  dataset  also  demonstrates  multilingual  abilities  of  our 
implementation).  Figure  2  shows  the  visualization  of  all  four  datasets  using  the 
least-squares  meshes  method.  Clearly,  least-squares  meshes  visualization  pro 
vides  enough  information  to  identify  potential  clusters,  their  approximate  sizes 
and  initial,  centroids.  This  visual  information  was  used  to  set  up  constraints  for 
the  constrained  clustering  scenario.  For  example,  in  the  Yahoo  Finance  dataset, 
nine  clusters  were  identified  by  the  user  and  the  corresponding  size  constraints 
were  defined  through  visual  assessment  by  using  the  graphical  user  interface. 
Figure  lb  presents  the  positions  of  our  reference  points  carrying  cluster  size 
constraints. 


(a)  (b)  (c)  (cl) 


Fig.  2.  V  isualization  of  datasets:  (a)  Yahoo  Finance  ((>177  instances),  (b)  subset  of 
1LP- Inductive  Logic  Programming  with  abstracts  available  (506  instances),  (e)  subset 
of  LLP  with  only  titles  available  (1401  instances)  ILP  (d)  DSI-Sloveniati  Informatics 
Conference  (833  instances) 


Tabic  1  summarizes  the  experimental  results.  Our  serni-supervised  algorithm 
was  compared  to  ordinary  k-mean  using  the  same  initial  centroids,  and  to  or¬ 
dinary  k-means  using  random  initial  centroids12.  The  number  of  clusters  was 
determined  visually  (least- squares  meshes  visualization)  by  identifying  dense 
components  and  well-separated  parts  of  data  space.  We  used  the  Davies-Bouldin 
cluster  validity  measure  |5]  which  is  a  function  of  the  ratio  of  the  sum  of  within- 
clnster  scatter  to  between-cluster  separation.  The  measure  is  defined  as: 


1  ” 
db=~  y 


max 


O  j  +  (Tj 

d(cj,  Cj ) 


(6) 


where  n  is  the  number  of  clusters,  crt  is  the  average  distance  of  all  data  instances 
in  cluster  i  to  their  cluster  center  a3  is  the  average  distance  of  all  data 
instances  in  cluster  j  to  their  cluster  center  r7,  and  d(cj.Cj)  is  the  distance  of 

11  Available  at  http://iil2.ijs.si /hCorpus.ht  mI#DSl 

12  Note,  however,  that  more  elaborate  clustering  techniques  were  not  used  as  our  goal 
was  not  to  obtain  the  best  possible  clustering  of  data  but  t  o  demonstrate  and  assess 
the  semi- super  vised  clustering  methodology  and  to  evaluate  the  modified  constrained 
k-means  clustering  algorithm. 
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cluster  centers  c}  and  cj ,  Small  values  of  the  DB  measure  correspond  to  clusters 
that  are  compact  and  whose  centers  are  far  away  from  each  other. 

Results  in  Table  1  clearly  show  that  the  proposed  methodology  is  effective.  Our 
semi-supervised  k-means  algorithm  outperforms  other  two  variants  of  the  k-means 
algorithm  both  in  terms  of  quality  of  obtained  clustering  and  the  convergence  rate. 
Moreover,  our  methodology  also  provides  control  of  the  clustering  process  by  in¬ 
corporating  knowledge  about  the  input  data  space  at  no  additional  cost.  With  the 
exception  of  the  degenerated1 1  ILP-titles  dataset  where  it  needed  more  steps  to 
converge  (but  gave  better  clustering),  our  variant  of  semi-supervised  constrained 
clustering  achieved  the  best  scores.  Table  1  also  demontrates  the  effectiveness  of 
the  visualization  method  used  as  the  k-means  algorithm  (with  the  same  initial  cen¬ 
troids  used  in  semi-supervised  k-means)  outperformed  randomized  k-means  be¬ 
cause  of  the  visualization  which  enabled  us  to  select  good  initial  centroids  -  better 
than  those  selected  at  random  in  a  set  of  10  trials. 


Table  1.  Empirical  evaluation  of  the  proposed  methodology  and  comparison  of  al¬ 
gorithms  on  four  document  corpora.  Small  values  of  the  DB  measure  indicate  good 
clustering.  Note  that  the  values  for  k-means  with  random  initial  centroids  are  averaged 
over  10  repetitions. 


k-means 

rand,  k-means 

semi-sup.  k- means 

DB  measure 

# iters 

DB  measure 

#  iters 

DB  measure 

liters 

Yahoo  finance 

10.2 

21 

10.11 

23.3 

9.03 

9 

DSI 

8.33 

7 

8.71 

9.0 

7.84 

7 

ILP 

7.15 

16 

7.5 

10.3 

6.72 

10 

ILP-titles 

8.12 

6 

8.47 

7.0 

7.57 

14 

While  the  evaluation  was  carried  out  oil  only  four  document  corpora,  some 
general  conclusions  can  be  drawn.  Firstly,  least-squares  meshes  visualization 
method  is  able  to  provide  enough  background  knowledge  about  the  input  cor¬ 
pora  which  can  be  used  to  supervise  the  clustering  process.  However,  as  the 
proposed  approach  is  not  limited  to  textual  data,  other  techniques  for  dimen¬ 
sionality  reduction  need  to  be  employed  for  other  types  of  data.  To  this  end,  our 
implementation  contains  multi-dimensional  scaling  (MDS)  and  its  faster  simplifi¬ 
cation  Fast  Map  [7].  Secondly,  using  our  modification  of  the  constrained  k-incans 
algorithm  it  is  possible  to  pose  specific  constraints  oil  specific  clusters  and  the 
results  show  that,  backed  up  by  the  visualization,  such  a  setup  provides  power¬ 
ful  and  efficient  means  to  semi-supervised  analysis  of  data.  Finally,  the  proposed 
solution  can  be  used  on  any  kind  of  data  by  potentially  modifying  the  similarity 
measure14  in  the  visualization  algorithm. 

13  The  average  number  of  features  of  vectors  in  this  dataset  is  only  14  which  is  extremely 
low  for  a  textual  dataset. 

14  The  cosine  similarity  measure  was  used  to  compute  similarities  between  TF-IDF 
feature  vectors  obtained  from  documents. 
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5  Conclusions 

Iii  this  paper*  we  presented  a  novel  methodology  for  expert-guided  seini- 
supervised  data  analysis  which  counters  the  identified  problems.  First,  it 
allows  the  user  to  interrelate  the  user's  knowledge  of  the  data  with  the  specified 
size  constraints.  This  is  achieved  by  anchoring  the  constraints  to  the  specified 
reference  points  (which  remain  fixed  in  the  data  space)  rather  than  to  the  cen¬ 
troids  (which  move).  Second,  for  the  user  to  be  able  to  specify  the  anchor  points, 
the  data  is  visualized  by  projecting  the  high  dimensional  feature  vectors  onto  a 
planar  canvas.  By  inspecting  the  visualization,  the  user  is  able  to  identify  con¬ 
densed  groups  of  data  instances  and  place  reference  points  into  centers  of  such 
groups  of  instances. 

The  advantage  of  the  proposed  approach  is  that  the  clusters,  resulting  from 
such  semi-super  vised  clustering  process,  tend  to  be  of  higher  quality  than  clusters 
obtained  by  using  the  ordinary  (constrained)  k- means  algorithm.  Furthermore, 
the  clustering  optimization  process  tends  to  converge  faster.  Both  these  effects 
result  from  the  fact  that  the  user-defined  reference  points  are  a  much  better  set 
of  initial  centroids  than  the  randomly  selected  ones.  As  illustrated  on  the  Yahoo 
Finance  dataset,  the  least-squares  meshes  visualization  method  provides  enough 
background  knowledge  about  the  input  data  for  the  user  to  effectively  supervise 
the  clustering  process.  Next,  using  onr  modification  of  the  constrained  k-means 
algorithm,  it  is  possible  to  pose  specific  constraints  to  data  with  certain  specifics 
(e.g.  documents  talking  about  the  same,  similar  topic). 

In  conclusion,  the  proposed  methodology  presents  a  powerful  and  effective  way 
to  supervise  the  analysis  of  data.  In  further  work,  we  plan  to  conduct  experiments 
also  on  11011-textual  data  using  appropriate  visualization  techniques  such  as  PCA, 
MDS  and  SOM.  We  expect  that  the  proposed  methodology  will  demonstrate  its 
abilities  even  more  clearly  as  the  number  of  dimensions  in  non- textual  data 
is  typically  magnitudes  lower  and  good  low-dimensional  projections  are  much 
easier  to  obtain. 

We  will  also  improve  the  user  interface  by  integrating  a  keyword  extrac¬ 
tor  which  will  provide  summarized  information  about  data  instances.  This  will 
greatly  improve  the  understanding  of  the  observed  corpora  and  help  to  identify 
and  inspect  potential  groups  (clusters,  topics)  and  their  properties.  We  also  plan 
to  release  the  software  under  an  open  source  license.  Moreover,  we  will  offer  the 
individual  components  of  the  presented  methodology  as  a  sot  of  services,  freely 
available  on  the  Web,  and  ready  to  he  used  in  any  service-oriented  data  mining 
environment. 
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Abstract.  We  consider  the  problem  of  computing  optimal  plans  for  proposi¬ 
tional  planning  problems  with  action  costs.  In  the  spirit  of  leveraging  advances 
in  general-purpose  automated  reasoning  for  that  setting,  we  develop  an  approach 
that  operates  by  solving  a  sequence  of  partial  weighted  MaxSAT  problems,  each 
of  which  corresponds  to  a  step-bounded  variant  of  the  problem  at  hand  Our 
approach  is  the  first  SAT-based  system  in  which  a  proof  of  cost-optimality  is 
obtained  using  a  MaxSAT  procedure.  It  is  also  the  first  system  of  this  kind  to 
incorporate  an  admissible  planning  heuristic.  We  perform  a  detailed  empirical 
evaluation  of  our  work  using  benchmarks  from  a  number  of  International  Plan¬ 
ning  Competitions. 


1  Introduction 

Recently  there  have  been  significant  advances  in  the  direction  of  optimal  planning  pro¬ 
cedures  that  operate  by  making  multiple  queries  to  a  decision  procedure,  usually  a 
Boolean  SAT  procedure.  For  example,  the  work  of  Hoffman  et  al.  [  1  ]  answers  a  key 
challenge  from  Kautz  [2]  by  demonstrating  how  existing  SAT-based  planning  tech¬ 
niques  can  be  made  effective  solution  procedures  for  fixed-horizon  planning  with  met¬ 
ric  resource  constraints.  In  the  same  vein,  Russell  &  Holden  [3]  and  Giunchiglia  & 
Maratea  [4]  develop  optimal  SAT-based  procedures  for  net-benefit  planning  in  fixed- 
hori/.on  problems.  In  that  case  actions  can  have  costs  and  goal  utilities  can  be  inter¬ 
dependent.  Moreover,  in  the  direction  of  improving  the  scalability  and  efficiency  of 
SAT-based  approaches  in  step-optimal  (and  indeed  fixed-horizon)  planning,  Robinson 
et  al.  [5 1  presents  an  encoding  of  step-bounded  planning  problems  that  shows  signifi¬ 
cant  performance  gains  over  previous  results  Large  performance  gains  have  also  been 
demonstrated  where  efficient  and  sophisticated  query  strategies  are  employed  [6,7]. 
Summarising,  in  the  settings  of  step-optimal  and  fixed-horizon  planning,  recent  works 
have  demonstrated  that  SAT-based  techniques  inspired  by  systems  like  BLACKBOX  [8] 
continue  to  dominate  other  approaches. 

Considering  the  planning  literature  more  generally,  numerous  distinct  criteria  for 
plan  optimality  have  been  proposed.  These  include;  ( 1 )  Minimise  makespan  (a.k.a.  step- 
optimality );  The  objective  is  to  find  a  plan  of  minimal  length.  (2)  Minimise  plan  cost ; 
Each  action  has  a  numeric  cost,  a  plan's  cost  is  the  sum  of  the  costs  of  its  constituent 
actions,  and  an  optimal  plan  has  minimal  cost.  (3)  Maximise  net-benefit ;  States  (resp. 
actions)  have  rewards  (resp.  costs),  and  an  optimal  plan  is  a  sequence  of  actions  ex¬ 
ecutable  from  the  starting  state  that  induces  a  behaviour  of  maximal  utility  -  These 
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problems  are  sometimes  called  oversubscribed ,  and  were  recently  shown  to  be  equiv¬ 
alent  (using  a  compilation)  to  the  cost-optimising  setting  [9].  One  key  observation  to 
be  made  is  that  the  above  optimality  criteria  arc  often  conflicting.  For  example,  a  plan 
with  minimal  niakespan  is  not  guaranteed  to  be  cost -  or  urf/iTy-optimal.  Indeed,  in  the 
general  case  there  is  no  link  between  the  number  of  plan  steps  (planning  horizon)  and 
plan  quality. 

Existing  SAT-based  planning  procedures  are  limited  to  makespan- optimal  and  fixed- 
horizon  settings  -  i.e.,  either  the  objective  is  to  minimise  the  number  of  plan  steps, 
or  valid  optimal  solutions  are  constrained  to  be  of,  or  less  than,  a  fixed  length.  Thus, 
the  use  of  SAT-based  techniques  is  limited  in  practice.  For  example,  optimal  SAT- 
based  planning  procedures  were  unable  to  participate  effectively  at  the  International 
Planning  Competition  (IPC)  in  2008  due  to  the  adoption  of  a  single  optimisation  cri¬ 
teria  (cost-optimality).  This  paper  overcomes  that  restriction,  developing  COS-P,  the 
fist  sound  and  complete  cost-optimal  planning  procedure  based  solely  on  a  Boolean 
SAT(isfiability)  procedure  Thus,  we  open  the  door  to  leveraging  SAT  technology  in 
planning  settings  with  arbitrary  optimisation  criteria. 

The  remainder  of  this  paper  is  organised  as  follows.  We  first  give  an  overview  of 
optimal  propositional  planning  with  action  costs,  delete  relaxations  of  that  problem, 
and  the  partial  weighted  MaxSAT  optimisation  problem.  We  then  describe  our  approach 
in  detail,  developing  compilations  to  partial  weighted  MaxSAT  of  the  fixed-hori/on 
planning  problem,  and  of  the  fixed  horizon  problem  with  a  relaxed  suffix.  Following  this 
we  develop  our  novel  MaxSAT  solution  procedure  PWM-RSat.  We  then  empirically 
evaluate  our  approach  on  planning  benchmarks  from  a  number  of  IPCs.  Finally  we 
discuss  some  related  work  and  propose  some  interesting  directions  for  future  research. 


2  Background  and  Notations 

2.1  Propositional  Planning  with  Action  Costs 

A  propositional  planning  problem  with  costs  is  a  5-tuple  II  =  (P.A.sqXX).  Here, 
P  is  a  set  of  propositions  that  characterise  problem  states;  A  is  the  set  of  actions  that 
can  induce  state  transitions;  «$o  Q  P  is  the  starting  state;  And  Q  C  P  is  the  set  of 
propositions  that  characterise  the  goal.  The  function  C  :  A  — *  is  a  bounded  cost 
function  that  assigns  a  non-negative  cost-value  to  each  action.  This  value  corresponds 
to  the  cost  of  executing  the  action. 

Each  action  a  £  A  is  described  in  terms  of  its  preconditions  prc(a)  C  P,  positive 
effects  effm(a)  C  P,  and  negative  effects  effQ(a)  C  P.  An  action  a  can  be  executed  at  a 
state  s  C  P  when  pre(o)  C  ,s.  We  write  >t(s)  for  the  set  of  actions  that  can  be  executed 
at  state  s  -  Formally,  *4(.s)  =  {a|a  £  Apre(a)  C  5}.  When  a  £  A(s)  is  executed  at 
s  the  successive  state  is  (,s  U  eff.(a))\eff0(a) .  Actions  cannot  both  add  and  delete  the 
same  proposition  -  i.e.,  d'f9( a)  Pi  cffQ(a)  =  0.1  A  state  s  is  a  goal  state  iff  Q  C  s. 

Usually  any  two  actions  a i,«2  €  A  are  permitted  to  be  executed  instantaneously 
in  parallel  at  a  state  provided  any  serial  execution  of  the  actions  is  valid  and  achieves 

1  In  practice  this  case  is  given  a  special  semanlies,  lhe  dciails  of  which  shall  nol  be  considered 
further  here. 
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an  identical  outcome.  When  two  aetions  cannot  be  executed  in  parallel  we  say  they 
conflict.  Supposing  non-conflicting  actions  can  be  executed  instantaneously  in  parallel, 
a  plan  n  is  a  discrete  sequence  of  time-indexed  sets  of  non-conflicting  actions  which, 
when  applied  to  the  start  state,  lead  to  a  goal  state.  We  say  a  plan  is  serial  (a.k.a.  linear 
plan ),  denoted  7r,  if  each  time-indexed  set  contains  one  action  Finally,  w  here  A1  is  the 
set  of  actions  at  step  i  of  tt  —  [A 1  .  A 2, ...  Ah ],  the  cost  of  i r,  written  C(tt),  is: 

A  number  of  different  conditions  for  plan  optimality  can  be  defined.  In  particular,  a 
plan  is  parallel  step-optimal  if  no  shorter  plan  of  the  same  parallel  format  exists.  The 
definition  for  serial  step-optimality  is  identical,  but  also  respects  the  condition  that  a 
valid  plan  has  only  one  action  executed  at  each  step.  A  plan  tt*  is  cost-optimal  if  there 
is  no  plan  n  s.t.  C(tt)  <  C( n*).  Finally,  we  draw  the  reader’s  attention  to  the  fact  that 
the  definition  of  cost-optimality  is  not  dependent  on  the  plan  format. 

2.2  The  Relaxed  Planning  Problem 

A  delete  relaxation  7/4  of  a  planning  problem  //  is  an  equivalent  problem  in  all  respects 
except  the  definition  of  actions.  In  particular,  the  set  of  actions  A +  in  77+  comprises 
the  elements  a  €  A  from  If  altered  so  that  eff0(a)  =  0.  The  relaxed  problem  has  two 
key  properties  of  interest  here.  First,  the  cost  of  an  optimal  plan  from  any  reachable 
state  in  fl  is  greater  than  or  equal  to  the  eost  of  the  optimal  plan  from  that  state  in  /7  +  . 
Consequently  relaxed  planning  can  yield  a  useful  admissible  heuristic  in  search.  For 
example,  a  best-first  search  such  as  A*  can  be  heuristically  directed  towards  an  optimal 
solution  by  using  the  costs  of  relaxed  plans  to  arrange  the  priority  queue.  Second,  al¬ 
though  NP-hard  to  solve  optimally  in  general  (10],  in  practice  optimal  solutions  to  the 
relaxed  problem  //+  are  more  easily  computed  than  for  II. 

2.3  Partial  Weighted  MaxSAT 

A  Boolean  SAT  problem  is  a  decision  problem,  instances  of  which  are  typieally  ex¬ 
pressed  as  a  CNF  propositional  formula.  A  CNF  corresponds  to  a  conjunction  over 
clauses,  each  of  which  corresponds  to  a  disjunction  over  literals.  A  literal  is  either  a 
proposition  (i.e..  Boolean  variable  symbol)  or  its  negation.  Where  |=  denotes  semantic 
entailment  for  propositional  logie,  a  solution  associated  with  a  formula  <p  is  an  assign¬ 
ment  (a.k.a.  valuation)  V  of  truth  values  to  propositions  with  the  property  V  (=  <p. 

A  Boolean  MaxSAT  problem  is  an  optimisation  problem  related  to  SAT.  In  practice 
a  problem  instance  is  again  typically  expressed  as  a  CNF,  however  the  objective  now  is 
to  compute  a  valuation  that  maximises  the  number  of  satisfied  clauses.  In  detail,  writing 
k  £  <fi  if  k  is  a  clause  in  formula  </>,  and  taking  V  k  to  have  numeric  value  1  when 
valid,  and  0  otherwise,  a  solution  V*  to  a  MaxSAT  problem  has  the  property: 

V*  =  argmaxv£«€*(v  h  «) 
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A  weighted  MaxSAT  problem  [11],  denoted  0,  is  a  MaxSAT  problem  where  each 
clause  k  £  ijj  has  a  bounded  positive  numerical  weight  u;(k).  The  optimal  solution  V* 
to  some  0  satisfies  the  following  equation: 

V*  =  argmaxv  <*>(«)  (V  |=  k) 

Finally,  the  partial  weighted  MaxSAT  problem  [12]  is  a  variant  of  weighted  MaxSAT 
that  distinguishes  between  hard  and  soft  clauses.  Only  soft  clauses  are  given  a  weight. 
In  these  problems  a  solution  is  valid  iff  it  satisfies  all  hard  clauses.  Therefore  we  have  a 
notion  of  satisfiability.  In  particular,  if  the  hard  problem  fragment  of  a  partial  weighted 
MaxSAT  formula  is  unsatisfiahle,  then  we  say  the  formula  is  unsatisfiable.  The  defini¬ 
tion  of  satisfiable  follows  naturally.  An  optimal  solution  to  a  partial  weighted  MaxSAT 
problem  is  an  assignment  V*  that  is  both  valid  and  satisfies  the  above  equation. 


3  COS-P 

We  now  describe  COS-P,  our  planner  that  operates  by  iteratively  solving  variants  of 
n -step-bounded  instances  of  the  problem  at  hand  for  successively  larger  n.  Solutions 
to  the  intermediate  step-bounded  instances  are  obtained  by  compiling  them  into  equiv¬ 
alent  partial  weighted  MaxSAT  problems,  and  then  using  our  own  MaxSAT  procedure 
PWM-RSAT  to  compute  their  optimal  solutions. 

COS-P  compiles  and  solves  two  variants,  Variant-1  and  Variant-11,  of  the  inter¬ 
mediate  instances.  Those  are  characterised  in  terms  of  their  optimal  solutions.  Adopting 
the  notation  FIn  for  the  ?*-step-bounded  variant  of  77,  Variant-1  admits  optimal  solu¬ 
tions  that  correspond  to  minimal  cost  plans  in  the  parallel  format  for  77n.  Vari ant-11 
admits  optimal  plans  with  the  following  structure.  Each  has  a  prefix  which  corresponds 
to  n  sets  of  actions  from  IIn:  Plans  can  have  an  arbitrary  length  suffix  (including  length 
0)  comprised  of  actions  from  the  delete  relaxation  /7+. 

Both  variants  can  be  categorised  as  direct ,  constructive ,  and  tightly  sound.  They  are 
direct  because  wc  have  a  Boolean  variable  in  the  MaxSAT  problem  for  every  action 
and  state  proposition  at  each  plan  step.  They  are  constructive  because  any  satisfying 
model  and  its  cost  in  the  MaxSAT  instances  corresponds  to  a  plan  and  its  cost  in  the 
source  prohlem.  Critically,  our  compilations  are  tightly  sound ,  in  the  sense  that  every 
plan  with  cost  c  in  the  source  planning  problem  has  a  corresponding  satisfying  model 
of  cost  c  in  the  MaxSAT  encoding  and  vice  versa.  This  permits  two  key  observations 
about  Variant-1  and  VARIANT-11.  First  when  both  variants  yield  an  optimal  solution, 
and  both  those  solutions  have  identical  cost,  then  the  solution  to  Variant-I  is  a  cost- 
optimal  plan  for  77.  Second,  if  77  is  soluble,  then  there  exists  some  n  for  which  the 
observation  of  global  optimality  shall  be  made  by  COS-P.  Finally,  we  have  that  COS- 
P  is  a  sound  and  complete  optimal  planning  procedure  for  propositional  problems  with 
action  costs. 

For  the  remainder  of  this  section  we  present  the  compilation  for  Variant-I  and 
Variant-11.  In  the  following  section  we  describe  the  MaxSAT  procedure  PWM-RSat 
that  we  developed  for  use  by  COS-P. 


i.c..  an  77-step  plan  prefix  in  the  parallel  format. 
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3.1  Variant-1:  Bounded  Cost-Optimal  Planning 

We  now  describe  a  direct  compilation  of  the  bounded  propositional  planning  problem 
with  action  costs  to  a  partial  weighted  MaxSAT  formula  t/\  The  source  of  our  com¬ 
pilation  is  the  plangraph.  This  is  an  obvious  choice  because  reachability  and  needed¬ 
ness  analysis  performed  during  construction  of  the  plangraph  yields  important  mutex 
constraints  between  action  and  propositional  variables  [13].  Such  constraints  are  not 
deduced  independently  by  modem  SAT  procedures  such  as  RSat2.()2  [14]. 

Below,  we  develop  our  compilation  in  terms  of  a  list  of  6  Schemata.  The  first  5 
schemata  capture  the  hard  logical  planning  constraints,  and  Schema  6  reflects  the  action 
costs.  Overall,  the  schemata  we  develop  below  make  use  of  the  following  propositional 
variables.  For  each  action  occurring  at  a  step  t  =  0, n  —  1  (excluding  noop  actions), 
we  have  a  variable  a1.  We  define  a  fluent  to  be  a  state  proposition  whose  truth  value 
can  be  modified  by  action  executions.  For  each  fluent  occurring  at  step  t  =  0, n  we 
have  a  variable//.  Also,  we  have  rnake(p)  =  {a|a  G  A  p  €  e/f#(u)},  and  hr(  ak(p)  = 
{a|a  €  A.p  €  cff0(a)}.  Below  we  avoid  annotating  variables  with  their  time  index  if 
it  is  clear  from  the  context.  Lastly,  all  constraints  are  hard  unless  stated  otherwise. 

/.  Goal  and  start  state  axioms:  We  have  a  unit  clause  containing  p°  for  every  p  G  so 
and  pn  for  every  p  G  Q. 

2.  Precondition  and  effect  axioms:  For  every  action  a  at  each  plan  step  /,  we  have 
clauses  that  require:  (i)  the  action  implies  its  precondition,  (ii)  the  action  implies  its 
positive  ef  fects,  and  (iii)  the  action  implies  its  negative  effects: 

1°'  f\peprc{a)l>1}  A  l°‘  7"  Apecff.(o)  P,  +  1]  A  1«‘_-  Apedfo(«)  V+1] 

3.  Mutex  axioms:  For  every  pair  of  mutex  symbols  (actions  or  fluents)  pi  and  po  at 

step  /,  we  have  a  clause:  -*p\  A  -7/, 

4.  At  least  one  action  axioms:  Where  Af  is  the  set  of  actions  at  step  /,  we  have  a 

clause  that  requires  at  least  one  action  be  executed  at  step  t:  \fufeA* ()t 

5.  Frame  axioms :  These  constrain  how  the  truth  values  of  fluents  change  over  suc¬ 
cessive  plan  steps.  For  each  proposition  7/ ,  t  >  0  we  include  the  following  clauses: 

\P1  —  (V  V  V n€makc(p)  «*“*)!  A  IV  (V"‘  V  V„e6reafc(,,)  «'  M] 

6.  Action  cost  axioms  (soft):  Finally,  we  have  a  set  of  soft  constraints  for  actions. 
In  particular,  for  each  action  variable  a 1  such  that  C(a)  >  0,  we  have  a  unit  clause 
k,  {->(?/}  with  weight  u}(Kt)  =  C(a). 


3.2  Variant-11:  71-Step  with  a  Relaxed  Suffix 

We  now  describe  a  direct  compilation  of  the  problem  Iln  from  the  previous  section, 
along  with  the  addition  of  a  causal  encoding  of  the  delete  relaxation,  that  we  make 
available  from  step  n}  From  hereon  we  refer  to  the  latter  as  the  relaxed  suffix. 

Our  encoding  of  the  relaxed  suffix  is  causal  in  the  sense  developed  in  [15]  for  their 
ground  parallel  causal  encoding  of  propositional  planning  in  SAT.  This  requires  addi¬ 
tional  variables  to  those  developed  for  Variant-1.  In  particular,  for  each  fluent  p  and 
relaxed  action  a  G  A+  we  have  corresponding  variables  p+  and  a  + .  That  p/  is  true 
intuitively  means:  (I)  That  ff-  was  false  (see  Variant-1),  and  (2)  That  p,  G  Q ,  or  pf  is 

In  VARIANT-ll  goal  constraints  from  Schema  1  are  omitied  from  lln. 
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the  cause  of  another  fluent  p ft  in  a  relaxed  suffix  to  the  goal.  That  ft4"  is  true  means  that 
a  is  executed  in  the  relaxed  suffix.  We  also  require  a  set  of  causal  link  variables.  These 
are  best  introduced  in  terms  of  a  recursively  defined  set  5°°  as  follows. 

S°  =  {IC{pi,pj)\a  e  A+  ,  p<  €  pre(a),]>j  £  eff, (a, ) f 
S'+]  =  S1  U  {IC(pj,pi)\K{pj,pk),IC{pk,Pi)  €  5*} 

For  each  /C(pi ,  Pj)  £  S°°  we  have  a  corresponding  variable.  Intuitively,  if  proposition 
IC(pi,  pj)  is  true  then  pi  is  the  cause  of  pj  in  the  plan  suffix. 

VariaNT-11  includes  all  schemata  from  Variant-1  except  the  goal  axioms  of 
Schema  1 .  We  also  suppose  Schema  6  is  now  inclusive  of  a4"  symbols.  Additionally 
wc  have  the  following  Schemata. 

7.  Relaxed  goal  axioms:  For  each  fluent  p  £  Q  we  assert  that  it  is  either  achieved  at 
the  planning  horizon  n ,  or  using  a  relaxed  action  in  A+ .  This  is  expressed  with  a  clause: 

pn  V  p+ 

8.  Relaxed  fluent  support  axioms:  For  each  fluent  p  we  have  a  clause: 

P+  -*  (Va€m«Mp)  a+) 

9.  Causal  link  axioms:  For  all  fluents  taking  all  a  £  make(pi)  and  p3  £  PRE (ft), 
we  have  the  following  clause: 

(p+ Afl+)-0^  V/C(pt.p+)) 

This  constraint  asserts  that  if  action  a  j  is  executed,  then  its  preconditions  must  be  true 
at  horizon  n ,  or  be  supported  by  some  other  action  <7^  with  p2  G  effm(a 2). 

/0.  Causality  implies  cause  and  effect  axiom:  For  each  causal  1  ink  variable  /C(p^ ,  p.J ) 
we  have  a  clause: 

r{vppt)  (j>T  ai>2  ) 

I  /.  Transitive  closure  and anti-reflexivity  axioms:  For  causal  link  variable  IC(p+ ,  p+) 
we  have  a  unit  clause  containing  that  variable  negated.  For  pairs  of  causal  link  variables 

(IC(pppl).  IC(ptpt)): 

, P2 ) A  r(pt ,pi))  -*  r(pi> pZ ) 

12.  Only  necessary  relaxed  fluent  axioms:  For  each  fluent  p  we  have  a  constraint: 

-np4-  V  pn 

13.  Relaxed  action  cost  dominance  axioms:  Let  P  be  a  set  of  non-mutex  fluents 
at  horizon  n.  Relaxed  action  <7/  is  redundant  in  an  optimal  solution  to  a  VaRIANT- 
II  instance,  if  the  fluents  in  P  are  true  at  horizon  n  and  there  exists  a  relaxed  ac¬ 
tion  aj  such  that:  (1)  costfa)  <  co$t(a\ ),  (2)  pre(«2)\ P  Q  pre(«i)\P,  and  (3) 
effm(a-[)\P  C  effm(a2)\P .  For  relaxed  action  a4-  that  is  redundant  for  P\  and  not 
redundant  for  any  I\,  if  |  P2\  <  \ P\  \  we  have  a  clause:4 

(A Per,  pn)  - 

4  In  practise  we  limit  |  Pi  |  to  2. 
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The  schemata  we  have  given  thus  far  are  theoretically  sufficient  for  our  purpose. 
However,  in  a  relaxed  suffix  most  causal  links  are  not  relevant  to  the  relaxed  cost 
of  reaching  the  goal  from  a  particular  state  at  horizon  n.  For  example,  in  a  logistics 
problem,  if  a  truck  t  at  location  l\  needs  to  be  moved  directly  to  location  / 2,  then 
the  fact  that  the  truek  is  at  any  other  location  should  not  support  it  being  at  I2  -  i.c. 
-i/C(at  (/,  /;*).  at(£,  /o)).  h  7*^  /1. 

The  following  schemata  provide  a  number  of  layers  that  actions  and  fluents  in  the 
relaxed  suffix  can  be  assigned  to.  Fluents  and  actions  are  forced  to  occur  as  early  in  the 
set  of  layers  as  possible  and  are  only  assigned  to  a  layer  if  all  supporting  actions  and 
fluents  occur  at  earlier  layers.  The  orderings  of  fluents  in  the  relaxed  layers  is  used  to 
restrict  the  truth  values  of  the  causal  link  variables.  The  admissibility  of  the  heuristic 
estimate  of  the  relaxed  suffix  is  independent  of  the  number  of  relaxed  layers. 

We  pick  an  horizon  k  >  11  and  generate  a  copy  a4  1  of  each  relaxed  action  at  each 

layer/  G  {n . k  —  1}  and  a  copy  p~^!  of  each  fluent  p+  at  each  layer  /  G  {n  -f- 1 . k}. 

We  also  have  an  auxiliary  variable  aux(p+l )  for  each  fluent  p +/  at  each  suffix  layer 

n  4  1 . k.  Intuitively,  proposition  au,x(p+l)  says  that  /;  is  false  at  every  layer  in  the 

relaxed  suffix  from  n  to 

14.  Layered  relaxed  action  axioms:  For  each  layered  relaxed  aetion  we  have  a 
clause: 

„+/  a+ 

15.  Layered  relaxed  actions  only  once  axioms:  For  each  relaxed  aetion  a  and  pair 

of  layers  l\ ,  1 2  E  {n . k  —  1 },  where  l\  7^/2,  we  have: 

16.  layered  relaxed  action  precondition  axioms:  For  each  layered  relaxed  action 
rt+/l  we  have  a  set  of  clauses: 

1  — ►  A/>gpre(o)  . /,}  V  2 

J7.  layered  relaxed  action  effect  axioms:  For  each  layered  relaxed  action  u+/l  and 
p  G  ADD(u)  there  is  a  clause: 

(«+'•  A /,+  )—>  V/a€n+, ,+  ,/'+'2 

18.  layered  relaxed  action  as  early  as  possible  axioms:  For  each  layered  relaxed 
aetion  a+/l ,  if  l\  —  n,  we  have  a  clause: 

a+  “*  Vp€PRE(«)  ■’P" 

if  /1  >  7t,  we  add: 

a+  -»  V,2€„ . 1,  - 1  «+'2  V  Vp6PRE(„)  V  <>+l' 

19.  Auxiliary  variable  axioms:  For  each  auxiliary  variable  av:v(p+lt )  there  is  a  set 
of  clauses: 


aux{p+4' )  < — »  {[>"  A  A/, 


€{»  +  ! M 


.+h) 


s  There  are  no  cost  constraints  associated  w  ith  the  layered  copies  of  relaxed  action  variables. 
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20.  Layered  fluent  axioms:  For  each  layered  fluent  p+l  we  add: 

p+l  — >  p+ 

21.  Layered  fluent  frame  axioms :  For  each  layered  fluent  p+l  there  is  a  clause: 


p+l  -  v„ 


£m<ike(p) 


22.  Layered  fluent  as  early  as  possible  axioms:  For  each  layered  fluent  /;+/l  there  is 
a  set  of  clauses: 

P+l'  haemnl"(P)  Ken,.  ll-2~'(l+h 

23.  Layered  fluent  only  once  axioms:  For  each  fluent  p  and  pair  of  layers  /j,/2  6 
{??  +  1, ....  k },  where  /]  ^  /2,  there  is  a  clause: 

-ip+6  v 


24.  Layered  fluents  prohibit  causal  links  axioms:  For  each  layered  fluent  p+tl 
fluent  p2  such  that  p\  7^  p2  and  3/C(z>J,  p+ )  there  is  a  clause: 


and 


P 1 


+/i 


(V;.2e{n+1  ..i-i}i4h  v -,^(P2 


4  PWM-RSat 

We  find  that  branch-and-bound  procedures  for  partial  weighted  MaxSAT  [11,12]  are 
ineffective  at  solving  our  direct  encodings  of  bounded  planning  problems.  Thus,  taking 
the  RS AT2.02  eodebase  as  a  starting  point,  we  developed  PWM-RSat,  a  more  efficient 
optimisation  procedure  for  this  setting.  An  outline  of  the  algorithm  is  given  in  Algo¬ 
rithm  1  Based  on  R SAT  [16],  PWM-RSat  can  broadly  be  described  as  a  backtracking 
search  with  Boolean  unit  propagation.  It  features  common  enhancements  from  state-of- 
the-art  SAT  solvers,  including  conflict  driven  clause  learning  with  non-chronological 
backtracking  [17,18],  and  restarts  [19|. 

Algorithm  1  outlines  two  variants  of  PWM-RSat  for  solving  VARIANT-1  and 
VARIANT- 11  formulas:  lines  5-6  will  only  be  invoked  if  the  input  formula  is  a  VaRI  ANT- 
11  encoding.  These  lines  prevent  the  solver  from  exploring  assignments  implying  that 
the  same  state  occurs  at  more  than  one  planning  layer. 

Apart  from  the  above  difference,  the  two  variants  of  PWM-RSat  work  as  follows. 
At  the  beginning  of  the  search,  the  current  partial  assignment  V  of  truth  values  to  vari¬ 
ables  in  is  set  to  empty  and  its  associated  cost  c  is  set  to  0.  We  use  c  to  track  the  best 
result  found  so  far  for  the  minimum  cost  of  satisfying  given  iJj+.  V*  is  the  total 
assignment  associated  with  c.  Initially,  V*  is  empty  and  e  is  set  to  an  input  non-negative 
weight  bound  c1  (if  none  is  known  then  c  =  c1  :=  oo).  Note  that  the  set  of  asserting 
clauses  r  is  initiated  to  empty  as  no  clauses  have  been  learnt  yet. 

The  solver  then  repeatedly  tries  to  expand  the  partial  assignment  V  until  either  the 
optimal  solution  is  found  or  ijj  is  proved  unsatisfiable  (line  4-21).  At  each  iteration,  a 
call  to  Sat(JP(V.  k)  applies  unit  propagation  to  a  unit  clause  k  6  0  and  adds  new 
variable  assignments  to  V.  If  k  is  not  a  unit  clause,  SatUP(V7ip,K)  returns  1  if  k  is 
satisfied  by  V,  and  0  otherwise.  The  current  cost  c  is  also  updated  (line  7).  If  c  >  c,  then 
the  solver  will  perforin  a  backtrack-by-cost  to  a  previous  point  where  c  <  c  (line  8-9). 
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Algorithm  1.  Cost-Optimal  RSat  —  PWM-RSat 

I :  Input: 

-  A  given  non-negative  weight  bound  c1 .  If  none  is  known:  cl  oc 

-  A  CNF  formula  consists  of  the  hard  clause  set  >°°  and  the  soft  clause  set  v,+ 
2:  c  * —  0;  c  *—  cl : 

3:  v,  v*  -  [|;  r  -  0; 

4:  while  true  do 

5:  ir  solving  Variant-ll  &&  duplicating-layer\(V )  then 

6:  pop  elements  from  V  until  -^duplicating- layers(V):  continue; 

7:  c  —  £«£*,+ 

8;  if  c  >  c  then 

9:  pop  elements  from  V  until  c  <  c:  continue; 

10:  if  3k  6  (V-,oc  A  r)  s.t.  ->SatUP{  V.  i/,3C  A  T,  k)  then 

11:  if  restart  then  V  <—  [];  continue; 

1  2:  leam  clause  with  assertion  level  rn;  add  it  to  T: 

13:  pop  elements  from  V  until  |V|  =  rn; 

14:  if  V  =  (]  then 

15:  ifV*  ^  []  then  return  (V*,o)  as  the  solution 

1 6:  else  return  UNSATISF1ABLE, 

17:  else 

18:  if  V  is  total  then 

19:  V*  «—  V:  c  «—  c; 

20:  pop  elements  from  V  until  c  <  c; 

2 1 .  add  a  new  variahle  assignment  to  V; 


During  the  search,  if  the  current  assignment  V  v  iolates  any  clause  in  (v^  A  F),  then 
the  solver  will  either  (i)  restart  if  required  (line  1 1),  or  (ii)  try  to  learn  the  conflict  (line 
12)  and  then  backtrack  (line  13)  If  the  backtracking  causes  all  assignments  in  V  to  be 
undone,  then  the  solver  has  successfully  proved  that  either  (i)  (V*,c)  is  the  optimal 
solution,  or  (ii)  0  is  unsatisfiable  if  V*  remains  empty  (line  14-16).  Otherwise,  if  V 
docs  not  violate  any  clause  in  (0°°  A  F)  (line  17),  then  the  solver  will  heuristically  add 
a  new  variable  assignment  to  V  (line  21)  and  repeat  the  loop  in  line  4.  Note  that  if  V  is 
already  complete,  the  better  solution  is  stored  in  V*  together  will  the  new  lower  cost  c 
(line  19).  The  solver  also  performs  a  backtrack  by  cost  (line  20)  before  try  ing  to  expand 
V  in  line  21 . 


5  Experimental  Results 

We  implemented  both  COS-P  and  PWM-RSat  in  C++.  We  now  discuss  our  experi¬ 
mental  comparison  of  COS-P  with  1PC  baseline  planner  BASELINE,6  and  a  version  of 
COS-P  called  H-ORACLE.  The  latter  is  given  (by  an  oracle)  the  shortest  horizon  that 
yields  a  globally  optimal  plan.  Planning  benchmarks  included  in  our  evaluation  include: 
I  PC-6:  ELEVATORS,  PEG  SOITA1RE,  and  TRANSPORT;  1PC-5:  STORAGE,  and  TPP;  IPC- 
3:  DEPOTS,  DRIVERLOG,  FREECELL,  ROVERS,  SATELLITE,  and  ZENOTRAVEL;  and 
IPC-1:  BLOCKS,  GRIPPER,  and  M ICON  1C.  We  also  developed  our  own  domain,  called 
FTB,  that  demonstrates  the  effectiveness  of  the  factored  problem  representations  em¬ 
ployed  by  SAT-based  systems  such  as  COS-P.  This  domain  has  the  following  impor¬ 
tant  properties:  (1)  it  has  exponentially  many  states  in  the  number  of  problem  objects, 
(2)  if  there  are  n  objects,  then  the  branching  factor  is  such  that  a  breadth-first  search 

6  The  de  facto  winning  entry  at  the  last  IPC. 
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encounters  all  the  states  at  depth  n,  and  (3)  all  plans  have  length  n ,  and  plan  optimal¬ 
ity  is  determined  by  the  first  and  last  aetions  (only)  of  the  plan.  This  domain  eripples 
state-based  systems  such  as  HSP,  BASELINE,  and  Gamer,  either  beeause  they  are  do¬ 
ing  a  non-faetorcd  forward  heuristic  search,  or  beeause  — i.c.,  in  the  case  of  GAMER 
and  BASELINE —  they  perform  a  breadth-first  seareh.  Finally,  experiments  were  run  on 
a  cluster  of  AMD  Optcron  252  2.6GHz  processors,  eaeh  with  2GB  of  RAM.  All  plans 
computed  by  COS-F,  H-ORACLE,  and  BASELINE  were  verified  by  the  Strathelyde 
Planning  Group  plan  verifier  VAL,  and  computed  w  ithin  a  timeout  of  30  minutes. 

The  results  of  our  experiments  arc  summarised  in  Table  1 .  For  eaeh  domain  there  is 
one  row  for  the  hardest  problem  instance  solved  by  eaeh  of  the  three  planners.  Here,  we 
measure  problem  hardness  as  the  time  it  takes  each  solver  to  return  the  optimal  plan.  In 
some  domains  we  also  inelude  additional  instances.  Using  the  same  experimental  data 
as  for  Table  1,  Figure  1  plots  the  cumulative  number  of  instances  solved  over  time  by 
eaeh  planning  system,  supposing  invocations  of  the  systems  on  problem  instances  arc 
made  in  parallel  It  is  important  to  note  that  the  size  of  the  CNF  encodings  required 
by  COS-P  (and  H-ORACLE)  are  not  prohibitively  large  -  i.e,  where  the  SAT-based 
approaches  fail,  this  is  typically  beeause  they  exeecd  the  30  minutes  timeout,  and  not 
because  they  exhaust  system  memory. 


Pi»nnir>g  lime  (t) 


Fig.  1.  The  number  of  problems  solved  in  parallel  after  a  given  planning  time  for  each  approach 

COS-P  outperforms  the  Baseline  in  the  BLOCKS  and  FTB  domains.  For  example, 
on  BLOCKS- 1  8  Baseline  takes  39.15  seconds  while  COS-P  takes  only  3.47  seeonds. 
In  other  domains  BASELINE  outperforms  COS-P,  sometimes  by  several  orders  of  mag¬ 
nitude.  For  example,  on  problem  ZENOTRAVEL-4  Baseline  takes  0.04  seeonds  while 
COS-P  takes  841.2.  More  importantly,  we  discovered  that  it  is  relatively  easy  to  find 
a  cost-optimal  solution  compared  to  proving  its  optimality.  For  example,  on  MICONIC- 
23  COS-P  took  0.53  seconds  to  find  the  optimal  plan  but  spent  1453  seconds  proving 
cost-optimality.  More  generally,  this  observation  is  indicated  by  the  performance  of 
H-ORACLE. 

Overall,  we  find  that  elausc  learning  procedures  in  PWM-RSAT  cannot  exploit  the 
presence  of  the  very  effective  delete  relaxation  heuristic  from  7/+.  Consequently,  a 
serious  bottleneck  of  our  approach  comes  from  the  time  required  to  solve  Variant-II 
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Table  1.  C*  is  the  optimal  cost  for  each  problem.  All  times  arc  in  seconds.  For  Baseuni  t  is 
the  solution  time.  For  H- Oracle,  n  is  the  horizon  returned  by  the  oracle  and  t  is  the  time  taken 
to  find  the  lowest  cost  plan  at  n.  For  COS-P,  tt  is  the  total  time  for  all  SAT  instances.  tn  is  the 
total  time  for  all  SAT  instances  where  the  system  was  searching  for  a  plan,  while  t .  is  the  total 
time  for  all  SAT  instances  where  the  system  is  performing  optimality  proofs.  indicates  that  a 
solver  either  timed  out  or  ran  out  of  memory. 
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0 

12 

0.08 

12 

1.63 

0.23 

1. 41 

pegsol-9 

5 

0.02 

15 

7.07 

15 

416.6 

12.25 

404.4 

pegsol-13 

9 

0.14 

21 

1025 

- 

pegsol-26 

9 

42.44 

* 

- 

- 

- 

rovers -3 

1  1 

0.02 

8 

0.1 

8 

53.21 

0.08 

53.13 

rovcrs-5 

22 

164.1 

8 

69.83 

- 

- 

- 

satellite- 1 

9 

0 

8 

0.08 

8 

0.92 

0.1 

0.82 

satellite-2 

13 

0  01 

12 

0.23 

- 

- 

satellite-4 

17 

6.61 

- 

- 

- 

- 

- 

storage-7 

14 

0 

14 

0.45 

14 

1.16 

1.16 

0 

storage-9 

1  1 

0.2 

9 

643.2 

- 

- 

- 

storage- 1 3 

18 

3.47 

18 

112.1 

18 

262.8 

262.8 

0 

storage- 14 

19 

60.19 

* 

- 

- 

- 

- 

- 

TPP  5 

19 

0.15 

7 

0.01 

- 

transport- 1 

54 

0 

5 

0.02 

5 

0.27 

0.03 

0.24 

transport -4 

318 

47.47 

- 

- 

- 

- 

- 

- 

lransport-23 

630 

0.92 

9 

1.28 

- 

- 

/enotravel-4 

8 

0.04 

7 

1.07 

7 

843.7 

2.47 

841.2 

/enotravel-6 

11 

8.77 

7 

54.35 

- 

- 

- 

- 

zenot  ravcl-7 

15 

5.21 

8 

1600 

- 

instances.  On  a  positive  note,  those  proofs  are  possible,  and  in  domains  such  as  BLOCKS 
and  FTB,  where  the  branching  factor  is  high  and  useful  plans  long,  the  factored  problem 
representations  and  corresponding  solution  procedures  in  the  SAT-based  setting  payoff. 
Moreover,  in  fixed-horizon  cost-optimal  planning,  the  SAT  approach  continues  to  show 
good  performance  characteristics  in  many  domains. 
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6  Concluding  Remarks 

In  this  paper  we  demonstrate  that  a  general  theorem-proving  technique,  particularly 
a  DPLL  procedure  for  Boolean  SAT,  can  be  modified  to  find  cost-optimal  solutions 
to  propositional  planning  problems  encoded  as  SAT.  In  particular,  we  modified  SAT 
solver  RSAT2.02  to  create  PWM-RSat,  an  effective  partial  weighted  MaxSAT  proce¬ 
dure  for  problems  where  all  soft  constraints  are  unit  clauses.  This  forms  the  underlying 
optimisation  procedure  in  COS-P,  our  cost-optimal  planning  system  that,  for  succes¬ 
sive  horizon  lengths,  uses  PWM-RSAT  to  establish  a  candidate  solution  at  that  horizon, 
and  then  to  determine  if  that  candidate  is  globally  optimal.  Each  candidate  is  a  minimal 
cost  step-bounded  plan  for  the  problem  at  hand.  That  a  candidate  is  globally  optimal  is 
known  if  no  step-bounded  plan  with  a  relaxed  suffix  has  lower  cost.  To  achieve  that,  we 
developed  a  MaxSAT  encoding  of  bounded  planning  problems  with  a  relaxed  suffix. 
This  constitutes  the  first  application  of  causal  representations  of  planning  in  proposi¬ 
tional  logic  [  15]. 

Existing  work  directly  related  to  COS-P  includes  the  hybrid  solver  CO-PLAN  [20] 
and  the  fixed-horizon  optimal  system  Plan-A.  Those  systems  placed  4th  and  last  re¬ 
spectively  out  of  10  systems  at  IPC-6.  CO-Pi  an  is  hybrid  in  the  sense  that  it  proceeds 
in  two  phases,  each  of  which  applies  a  different  search  technique.  The  first  phase  is 
SAT-based,  and  identifies  the  least  costly  step-optimal  plan.  Plan-A  also  performs 
that  computation,  however  assumes  that  a  least  cost  step-optimal  plan  is  globally  op¬ 
timal  -Therefore  Plan-A  was  not  competitive  because  it  could  not  find  globally  op¬ 
timal  solutions,  and  thus  forfeited  in  many  domains.  The  first  phase  of  CO-Plan  and 
the  Plan-A  system  can  be  seen  as  more  general  and  efficient  versions  of  the  system 
described  in  [21].  The  second  phase  of  CO-Plan  breaks  from  the  planning-as-SAT 
paradigm.  It  corresponds  to  a  cost-bounded  anytime  best- first  search.  The  cost  bound 
for  the  second  phase  is  provided  by  the  first  phase.  Although  competitive  with  a  number 
of  other  competition  entries,  CO-Pi  AN  is  not  competitive  in  IPC-6  competition  bench¬ 
marks  with  the  BASELINE  -  The  de  facto  winning  entry,  a  brute-force  A*  in  which  the 
distance-plus-cost  computation  always  takes  the  distance  to  be  zero. 

Other  work  related  to  COS-P  leverages  SAT  modulo  theory  (SMT)  procedures  to 
solve  problems  with  metric  resource  constraints  [22].  SMT-solvers  typically  interleave 
calls  to  a  simplex  algorithm  with  the  decision  steps  of  a  backtracking  search,  such  as 
DPLL.  Solvers  in  this  category  include  the  systems  LPSAT  [22],  TM-LPSAT  [23], 
and  NumReach/SMT  [1].  SMTLbased  planners  also  operate  according  to  the  BL  ACK¬ 
BOX  scheme,  posing  a  series  of  step-bounded  decision  problems  to  an  SMT  solver  until 
an  optimal  plan  is  achieved.  Because  they  are  not  globally  optimal,  existing  SMT  sys¬ 
tems  are  not  directly  comparable  to  COS-P. 

The  most  pressing  item  for  future  work  is  a  technique  to  exploit  SMT  — and/or 
branch-and-bound  procedures  from  weighted  MaxSAT —  in  proving  the  optimality  of 
candidate  solutions  that  PWM-RSat  yields  in  bounded  instances.  We  should  also  ex¬ 
ploit  recent  work  in  using  useful  admissible  heuristics  for  state-based  search  when  eval¬ 
uating  whether  horizon  n  yields  an  optimal  solution  [24]. 


This  was  supposed  to  be  possible,  although  in  a  very  impractical  sense  (final  remarks  of  [4]). 
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Abstract,  We  show  that  the  node  cumulative  influence  for  a  particular  class  of 
information  diffusion  model  in  which  a  node  can  be  activated  multiple  times,  i.e. 
Susceptible/Infeetive/  Susceptible  (SIS)  Model,  can  be  very  efficiently  estimated 
in  case  of  independent  cascade  (IC)  framework  with  asynchronous  time  delay. 
The  method  exploits  the  property  of  continuous  time  delay  within  a  stochastic 
framework  and  analytically  derives  the  iterative  formula  to  estimate  cumulative 
influence  without  relying  on  awfully  lengthy  simulations.  We  show  that  it  can 
accurately  estimate  the  cumulative  influence  with  much  less  computation  time 
(about  2  to  6  orders  of  magnitude  less)  than  the  naive  simulation  using  three 
real  world  social  networks  and  thus  it  can  be  used  to  rank  influential  nodes  quite 
effectively.  Further,  we  show  that  the  SIS  model  with  a  discrete  time  step,  i.e. 
fixed  synchronous  time  delay,  gives  adequate  results  only  for  a  small  time  span. 


1  Introduction 

The  proliferation  of  emails,  blogs  and  social  networking  services  (SNS)  in  the  World 
W  ide  Web  has  accelerated  the  creation  of  large  social  networks  [  1,2, 3,4,5 J.  Social  net¬ 
works  naturally  mediate  the  spread  of  various  information.  Innovation,  topics  and  even 
malicious  rumors  can  propagate  in  the  form  of  so-called  “word-of-mouth”  communica¬ 
tions.  Thus,  it  is  now  understood  that  social  networks  provide  rich  sources  of  informa¬ 
tion  that  is  useful  to  help  understand  the  dynamics  of  our  society,  e.g.  who  are  the  best 
group  of  people  to  spread  the  desired  information,  how  people  respond  to  other  people's 
opinion,  what  kind  of  topics  propagate  faster,  how  the  public  opinions  are  formed,  how 
the  way  the  information  spreads  differ  from  community  to  community,  etc. 

B.-T.  Zhang  and  M.A  Orgun  (Eds.):  PRICAI  2010,  LNA1  6230,  pp.  244-255,  2010. 
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Several  models  have  been  proposed  that  simulate  information  diffusion  through  a 
network.  The  most  widely-used  model  is  the  independent  cascade  (1C).  This  is  a  fun¬ 
damental  probabilistic  model  of  information  diffusion  1 6,7 1 ,  which  can  be  regarded  as 
the  so-ealled  susceptible/infective/vecovered  (SIR)  model  for  the  spread  of  a  disease  [2]. 
This  model  has  been  used  to  solve  sueh  problems  as  the  influence  maximization  problem 
which  is  to  find  a  limited  number  of  nodes  that  are  influential  for  the  spread  of  infor¬ 
mation  [7,8]  and  the  influence  minimization  problem  which  is  to  suppress  the  spread  of 
undesirable  information  by  blocking  a  limited  number  of  links  [9].  Here,  it  is  noted  that 
the  influence  of  a  node  is  defined  as  the  expected  number  of  nodes  that  it  can  activate 
due  to  the  stochastic  nature  of  the  information  diffusion.  The  SIR  model  assumes  that 
a  node,  once  infected,  never  re-infected  after  it  has  been  cured  (recovered).  Thus,  the 
influence  is  normally  defined  as  the  expected  number  of  recovered  nodes  at  the  end  of 
the  time  span  in  consideration.  The  other  elass  of  model  for  the  spread  of  a  disease  is 
the  so-called  susceptible/infective/susceptible  (SIS)  model  1 2],  where  a  node,  once  in¬ 
fected,  moves  to  a  susceptible  state  and  can  be  re-activated  multiple  times.  A  similar 
problem  can  be  solved  for  this  model,  too  [10,1 1].  In  these  models,  efficient  methods 
of  estimating  the  influence  have  been  proposed  based  on  bond  percolation,  strongly 
connected  component  decomposition,  burnout  and  pruning  [8,1 1  ],  but  no  analytical  so¬ 
lutions  have  been  found.  Thus,  efficiency  remains  that  the  computation  time  is  2  or  3 
orders  of  magnitude  faster  than  naive  simulation. 

The  1C  model  above,  whether  it  is  used  in  SIR  or  SIS  setting  eannot  handle  time- 
delays  that  are  asynchronous  and  continuous  for  information  propagation.  Time  step  is 
incremented  discretely  and  thus  the  node  states  are  updated  synchronously,  which  can 
be  viewed  that  the  time  delay  is  fixed  and  synchronous.  We  call  this  "fixed  time  de¬ 
lay”  for  short.  In  reality,  time  flows  continuously  and  thus  information,  too,  propagates 
on  this  continuous  time  axis.  For  any  node,  information  must  be  received  at  any  time 
from  any  other  nodes  and  must  be  allowed  to  propagate  to  yet  other  nodes  at  any  other 
time  with  a  possible  delay,  both  in  an  asynchronous  way.  We  call  this  "continuous  time 
delay”  for  short.  For  example,  the  following  scenario  in  case  of  SIS  setting  explains 
this  need.  Suppose  a  person  A  posted  an  article  to  a  blog  and  a  person  B  read  it  and 
responded  a  week  later.  Another  person  C  posted  an  article  on  the  same  topic  the  next 
day  A  posted  and  B  read  it  and  responded  the  same  day.  B  was  activated  twice,  first  by 
C  and  next  by  A  although  the  time  A  was  activated  is  earlier  than  C.  Thus,  for  a  realis¬ 
tic  behavior  analysis  of  information  diffusion,  we  need  to  adopt  a  model  that  explicitly 
represents  continuous  asynchronous  time  delay.  The  continuous  time  delay  SIR  model 
was  discussed  in  the  machine  learning  problem  setting  in  which  the  objective  was  to 
learn  the  parameters  in  the  diffusion  model  from  the  observed  time  stamped  node  acti¬ 
vation  sequence  data  [3,12].  In  [  12]  it  was  shown  that  the  parameters  can  be  learned  by 
maximizing  the  likelihood  of  the  observed  data  being  produced  by  the  model.  Note  that 
there  is  no  need  to  do  simulation  to  obtain  the  influence  degree  in  ease  of  SIR  setting 
because  the  final  influence  degree  is  equal  to  that  of  the  model  without  time  delay1  since 
a  node  is  not  allowed  to  be  rc-activatcd  multiple  times. 

In  this  paper,  we  address  the  problem  of  efficiently  estimating  the  cumulative  influ¬ 
ence  of  a  node  in  the  network  by  adopting  the  information  diffusion  model  that  allows 


1  This  is  equivalent  to  fixed  time  delay  in  discrete  time  selling. 
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continuous  time  delay  and  multiple  activation  of  the  same  node  under  the  framework  of 
independent  cascade  model,  called  CTSIS  for  short.  Interestingly,  although  the  model 
we  considered  in  this  paper  is  most  complicated  among  the  series  of  the  models  dis¬ 
cussed  above,  it  is  possible  to  derive  a  formula  analytically,  under  a  simplified  con¬ 
dition,  that  can  iteratively  estimate  the  cumulative  influence  of  a  node  exploiting  the 
property  of  continuous  time  delay  within  a  stochastic  framework.  What  makes  the  anal¬ 
ysis  easier  is  that  in  case  of  the  continuous  time  there  is  only  one  single  node  that  can 
be  activated  at  a  time,  i,e,  no  multiple  activations  at  different  nodes  at  the  same  time, 
and  no  simultaneous  activations  of  a  node  by  its  multiple  active  parents  each  of  which 
has  been  activated  at  a  different  time  in  the  past.  Thus  it  does  not  make  sense  to  define 
the  node  influence  at  a  specific  time  and  in  light  of  SIS  and  continuous  time  delay  we 
naturally  define  the  influence  to  be  an  integral  over  a  specified  time  span  ( cumulative 
influence ),  which  is  more  meaningful  in  many  practical  settings. 

We  show  that  the  proposed  method  (called  iterative  method)  can  accurately  esti¬ 
mate  the  cumulative  influence  with  much  less  computation  time  (about  2  to  6  orders 
of  magnitude  less)  than  empirical  mean  of  the  naive  simulation  method  with  a  limited 
number  of  runs  using  three  real  world  social  networks  with  different  sizes  and  connec¬ 
tivities.  The  method  can  be  used  to  rank  influential  nodes  quite  effectively.  Wc  compare 
the  proposed  methods  with  two  other  methods,  the  SIS  with  fixed  time  delay  and  the 
one  which  is  the  extreme  case  of  the  propose  method  where  the  time  span  is  set  to  be 
infinitely  large  (called  infinite  iterative  method).  We  show  that  these  are  indeed  less 
accurate  and  discuss  under  which  conditions  these  work  well,  e.g.  SIS  with  fixed  time 
delay  only  works  well  for  a  small  time  span. 

The  paper  is  organized  as  follows.  We  revisit  the  information  diffusion  model,  in 
particular  SIS  family,  in  section  2,  and  explain  the  proposed  method  of  cumulative 
influence  estimation  in  section  3.  Then  we  report  the  experimental  results  in  section  4, 
followed  by  discussion  in  section  5.  We  summarize  our  conclusion  in  section  6. 

2  Information  Diffusion  Model 

Let  G  =  (V,E)  be  a  directed  network,  where  V  and  E  (c  V  x  V)  stand  for  the  sets  of  all 
the  nodes  and  (directed)  links,  respectively.  For  any  v  €  V,  let  F(v;G)  denote  the  set  of 
the  child  nodes  (directed  neighbors)  of  v,  that  is, 

r(v;  G)  =  { w  €  V;  (\\w)zE). 

We  consider  information  diffusion  models  on  G  in  the  susceptible/infected/susceptible 
(SIS)  framework.  In  this  context,  infected  nodes  mean  that  they  have  just  adopted  the 
information,  and  we  call  these  infected  nodes  active  nodes. 

2.1  Basic  SIS  Model 

We  first  define  the  basic  SIS  model  for  information  diffusion  on  G.  In  the  model,  the 
diffusion  process  unfolds  in  discrete  time-steps  /  >  0,  and  it  is  assumed  that  the  state 
of  a  node  is  either  active  or  inactive.  For  every  link  (//,  v)  €  £,  we  specify  a  real  value 
kua,  with  0  <  kus  <  1  in  advance.  Here,  ku  v  is  referred  to  as  the  diffusion  parameter 
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through  link  (w,  i?).  Given  an  initial  active  node  i’o  and  a  time  span  T,  the  diffusion 
process  proceeds  in  the  following  way.  Suppose  that  node  u  becomes  active  at  time- 
step  t  (<  T).  Then,  node  u  attempts  to  activate  every  v  £  f  (u\G),  and  succeeds  with 
probability  k„.v.  If  node  u  succeeds,  then  node  v  will  become  active  at  time-step  /  +  1. 
Thus,  as  mentioned  in  1 ,  we  can  view  this  as  synchronous  fixed  time  delay2.  If  multiple 
active  nodes  attempt  to  activate  node  v  in  time-step  /,  then  their  activation  attempts  are 
sequenced  in  an  arbitrary  order.  On  the  other  hand,  node  u  will  become  inactive  at  time- 
step  t  +  1  unless  it  is  activated  by  an  active  node  in  time-step  t.  The  process  terminates 
if  the  current  time-step  reaches  the  final  time  T. 

2.2  Continuous-Time  SIS  Model 

Next,  we  extend  the  basic  SIS  model  so  as  to  allow  continuous-time  delays,  and  refer  to 
the  extended  model  as  the  continuous-lime  SIS  (CTSIS)  model \  This  model  can  be  in¬ 
terpreted  as  siisceptible/exposed/infective/snsceptible  (SEIS)  model  in  that  a  node  does 
not  become  active  (infected)  instantly  when  activated,  but  wait  for  a  while  (exposed) 
before  it  gets  activated  (infected).  Once  it  gets  activated,  it  instantly  turns  into  suscep¬ 
tible  state.  In  terms  of  information  diffusion  of  some  topic  in  blog  space,  this  activation 
corresponds  to  posting  a  blog  article  on  the  topic  (instantaneous  action). 

In  the  CTSIS  model  on  G,  for  each  link  (//,  v)  £  E,  wc  specify  real  values  rHV  and 
kU'V  with  rux  >  0  and  0  <  kus  <  1  in  advance.  We  refer  to  rlix  and  kux  as  the  time-delay 
parameter  and  the  diffusion  parameter  through  link  (n,  v)t  respectively. 

Let  T  be  the  time  span.  The  diffusion  process  unfolds  in  continuous-time  f,  and 
proceeds  from  a  given  initial  active  node  v>o  in  the  following  way.  Suppose  that  a  node 
u  becomes  active  at  time  /  (<  T).  Then  a  delay-time  6  is  chosen  for  n  s  every  child 
node  V  £  T(w;G)  from  the  exponential  distribution  with  parameter  rux.  If  t  +  6  <  T , 
v  is  activated  by  u  with  success  probability  kux  at  /  +  6  <  T.  Under  the  continuous 
time  framework,  there  is  no  possibility  that  multiple  parent  nodes  of  v  simultaneously 
activate  v  exactly  at  the  same  time  t  +  6.  The  process  terminates  if  the  current  time 
reaches  the  final  time  T. 

2.3  Influence  Function 

Let  T  be  the  time  span  for  the  CTSIS  model  on  G.  We  consider  a  time-interval  [To,  T 1 1 
with  0  <  To  <  T\  <  T.  For  any  node  v  £  V,  let  5 (v;  To, T\)  denote  the  total  number 
of  nodes  activated  within  time-interval  [To,  T||  for  the  probabilistic  diffusion  process 
from  an  initial  active  node  v  under  the  CTSIS  model.  Note  that  S (v%  To,  T\ )  is  a  random 
variable.  Let  (r(r;  7q,  T i  )  denote  the  expected  value  of  S (v;  7'q,  T i  ).  Wc  call  <r(v;  To,  T\ ) 
the  cumulative  influence  degree  of  node  v  within  time-interval  [To,  Ti].  Note  that  cr  is 
a  function  defined  on  V.  We  call  the  function  rr(  ;To,T|)  :  V  — >  R  the  cumulative 
influence  function  for  the  CTSIS  model  within  time-interval  [To,  T\  \  on  network  G. 

It  is  important  to  estimate  the  cumulative  influence  function  tr(* *;  To,  Tj)  efficiently. 
In  theory  we  can  simply  estimate  it  by  simulating  the  CTSIS  model  in  the  following 

2  This  may  well  he  called  as  “no  time  delay'’  because  time  delay  is  not  explicitly  represented  in 

the  formulation. 

*  Noie  that  the  information  propagates  at  a  certain  time  point,  but  its  delay  can  he  continuous. 
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way.  First,  a  sufficiently  large  positive  integer  M  is  specified.  For  each  v  e  V ,  the 
diffusion  process  of  the  CTSIS  model  is  simulated  from  initial  active  node  v,  and  the 
total  number  of  nodes  activated  within  time-interval  [To,  T\  |,  S (v;  Fo,  Fi),  is  calculated. 
Then,  <r(v;Fo,F|)  is  estimated  as  the  empirical  mean  of  S(v;Fo,  F i)  that  are  obtained 
from  M  such  simulations.  We  refer  to  this  estimation  method  as  the  naive  simulation 
method.  However,  as  shown  in  the  experiments,  this  is  extremely  inefficient,  and  cannot 
be  practical  (out  of  question).  In  this  paper,  we  deal  w  ith  the  case  “Fo  =  0,  F|  =  F”  for 
simplicity,  and  we  denote  cr(v;0,  F)  by  cr(y ;  F). 

3  Estimation  Methods 

For  a  given  directed  graph  G  =  (V,  E),  we  identify  each  node  with  a  unique  integer  from 
1  to  \V\.  Then  we  can  define  the  adjacency  matrix  A  €  {(),  1  }!V,XTI  by  setting  aUA ,  =  1  if 
(w,  v)  e  E;  otherwise  aUtV  =  0.  We  also  define  the  probability  matrix  P  €  [0,  l)iv|x|v|  by 
replacing  each  element  au%v  to  the  corresponding  diffusion  probability  kus  if  («,  v)  €  E. 
Let  fx  €  (0,  1  }|v|  be  a  vector  whose  v-th  element  is  1  and  other  elements  are  0,  and 
1  €  { 1 )  V|  be  a  vector  whose  elements  are  all  1 . 

3.1  Infinite  Iterative  Method 

We  can  calculate  the  number  of  nodes  that  are  reachable  with  7-steps  starting  from  a 
node  v  by  f[  A  1.  Thus,  when  considering  the  diffusion  probabilities,  we  can  calculate 
the  vector  of  the  expected  number  of  reachable  nodes  starting  from  each  node  within 
J  steps  by  P\  +  -  -  +  PJ\.  Therefor,  in  case  that  the  time-interval  is  [0,  oo],  according 
to  the  definition  of  the  CTSIS  model,  we  obtain  the  cumulative  influence  degree  ( r ^  as 
follows: 


oo 

(To,  =  J]PJ  1,  (1) 

7=1 

Note  that  the  vector  cr consists  of  values  of  the  cumulative  influence  functions,  i.e., 
cr(-;  oo).  We  refer  to  this  estimation  method  as  the  infinite  iterative  method. 

However,  there  exist  some  intrinsic  limitations  to  the  simple  iterative  method,  i.c., 
we  cannot  specify  arbitrary  time-interval  [Fo,  T\]  and  diffusion  probabilities  for  this 
method.  As  for  the  diffusion  probabilities,  w  hen  the  largest  eigenvalue  of  the  probabil¬ 
ity  matrix  P  is  less  than  1 ,  we  can  guarantee  to  obtain  finite  value  of  t r Jn  a  simple  case 
that  the  diffusion  parameters  are  uniform  for  any  link,  i.e.,  kus  =  k  for  any  (//,  v)  6  E, 
since  the  probability  matrix  P  is  equivalent  to  kA ,  the  diffusion  parameter  k  must  be 
less  than  the  reciprocal  of  the  the  largest  eigenvalue  of  the  adjacency  matrix  A.  Inciden¬ 
tally,  the  calculation  formula  for  this  simple  case  is  quite  similar  to  that  of  Bonacich’s 
centrality  [13]  and  identical  to  that  of  Katz’s  measure  [14]. 

3.2  Proposed  Method 

Wc  want  to  estimate  the  cumulative  influence  degree  within  time-interval  [Fo,  T\]  for 
arbitrary  diffusion  probabilities.  To  this  end,  we  introduce  the  probability  E(7;Fq,  T\) 
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that  diffusion  takes  7-steps  within  this  time-interval  according  to  the  CTSIS  model. 
Here,  in  order  to  simplify  our  derivation,  we  focus  on  the  simplest  case  that  the  time- 
deluy  parameters  are  uniform  for  any  link,  i.e.,  rux  =  r  for  any  (u,v)  e  E,  although 
our  approach  ean  be  naturally  extended  to  more  complex  settings.  In  a  special  ease 
where  To  =  0  and  T\  =  7\  we  denote  this  probability  by  R(J\  T).  Here  we  note  that 
R(J\  To,  T\ )  -  R(J\  T\ )  -  R(J\  To).  Thus  we  foeus  on  calculation  of  R(J\  T). 

Let  6j  be  a  random  variable  of  a  time-delay  for  the  y-th  step  (I  <  j  <  7).  In  order  to 
meet  the  condition  that  the  diffusion  takes  7-steps  within  time-interval  |0,  7],  the  total 

sum  of  the  time-delays  must  be  less  than  7,  i.e.,  0  <  tfi  H - +£y  <  T.  In  ease  of  7  =  L 

we  ean  easily  obtain  the  following  formula. 


R(\:T) 


rc\p{-rS])d6\  =  I  -  exp(-rT). 


(2) 


In  ease  of  7  >  2,  due  to  the  independence  of  time-delay  trials,  we  can  calculate  the 
probability  R{J,  T)  as  follows: 


ni  di  p 

Jo 

Here  by  noting  the  following  two  formulas. 


<T-{6i+“m+6j  i)  J 

J  /*  ex p(  - r6j )d6  \  •  •  •  d6 j  (3 ) 


7=i 


I 


•T~(6i+-+6j  ,) 


J  I 

re\p(-rdj)d6j  =  1  -  exp(-rT)  J  j  exp  (rtf,). 


T-{6\  +—+6j  2) 


exp(-rT)dd\  •  •  ddj-\  =  cxp(-rT) 


irT)J-' 

U- I)!’ 


we  can  calculate  Eq.  3  as  follows: 


„  ( 'T)J  1 

R(J:  T)  =  R{J  -  I ;  T)  -  exp(-;-7')  J  - 

w  1  /  • 

Therefore,  from  Eqs.  2  and  4,  we  ean  derive  the  following  explicit  formula: 


(4) 


R(  J\  T) 


exp  (-rT) 


z 


(rry  1 

(j- I)!* 


(5) 


Here,  we  ean  easily  see  that  R(J;  T)  is  a  monotonie  decreasing  function  approaching  to 
zero  as  7  increases. 

Now,  by  combining  Eqs.  I  and  5,  we  can  derive  a  new  method  for  estimating  the 
cumulative  influence  degree  within  time-interval  [To,  T\]  for  arbitrary  diffusion  proba¬ 
bilities.  We  ean  formulate  the  key  formula  as  follows: 


oa 

o-{t0.t,\  =  YjRU-.T0.T,)1,J1. 


(6) 
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Below  wc  can  summarize  the  algorithm  of  the  proposed  method. 

1.  Set  each  clement  of  to  0,  and  set  J  <—  1  and  jc  < —  1 

2.  Calculate  x  <—  Px  and  if  R(J\  7o,  7'i)||jc||  <  ?],  then  output  o,yr^T\\  ant^  terminate. 

3.  Set  (T[TqJx\  ^[ToJi]  +  /?(i;  To,T\)x  and  J  <—  J  +  1  and  return  to  2. 

In  this  algorithm,  x  e  R,V1  is  a  vector  to  calculate  the  expected  number  of  the  7-step 
reachable  nodes,  and  //  is  a  parameter  for  the  termination  condition.  In  our  experiments, 
T]  is  set  to  a  sufficiently  small  number,  i.e.,  10  ,2. 


4  Experiments 

We  first  evaluate  the  performance  (accuracy)  of  the  proposed  method  (iterative  method) 
by  comparing  with  the  naive  simulation  method  with  different  number  of  runs  to  esti¬ 
mate  the  empirical  mean  using  three  large  real  social  networks.  Wc  then  compare  the 
iterative  method  with  two  other  methods,  the  infinite  iterative  method  and  the  SIS  with 
fixed  time  delay  method  in  terms  of  the  estimated  cumulative  influence  degree  for  the 
CTS1S  model  using  the  same  networks.  Finally  we  compare  the  efficiency  (computation 
time)  of  the  iterative  method  with  the  naive  simulation  method.  In  all  the  experiments, 
we  consider  the  simplest  case  where  the  both  diffusion  and  time-delay  parameters  of 
the  CTS1S  model  are  uniform  for  any  link. 

4.1  Datasets 

We  employed  three  datasets  of  large  real  networks.  These  are  all  bidirectionally  con¬ 
nected  networks.  The  first  one  is  a  network  of  people  that  was  derived  from  the  “list  of 
people”  within  Japanese  Wikipedia,  also  used  in  [15],  and  has  9,481  nodes  and  245, 044 
directed  links  (the  Wikipedia  network).  The  second  one  is  a  network  derived  from  the 
Enron  Email  Dataset  [16]  by  extracting  the  senders  and  the  recipients  and  linking  those 
that  had  bidirectional  communications,  and  has  4, 254  nodes  and  44, 314  directed  links 
(the  Enron  network).  The  third  one  is  a  Coauthorship  network  used  in  [17]  and  has 
1 2, 357  nodes  and  38, 896  directed  links  (the  coauthorship  network). 

4.2  Accuracy  Evaluation 

We  evaluated  the  accuracy  of  the  proposed  method  by  comparing  it  with  the  naive  sim¬ 
ulation  method  mentioned  in  section  2.3.  We  speculate  that  the  cumulative  influence 
degree  estimated  by  taking  the  empirical  mean  of  the  results  of  the  naive  simulation 
method  converges  asymptotically  to  the  true  value  as  the  number  of  simulations  M 
increases.  Thus,  we  first  examined  how  the  difference  of  the  estimated  cumulative  in¬ 
fluence  degree  between  the  iterative  method  and  the  naive  simulation  method  changes 
as  M  changes  for  the  three  networks. 

The  difference  was  evaluated  by 

fw  =  £k(v;  7-)  -  j*f(v;  7WI, 

V€V 


(7) 
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where  <x(v;  T)  and  a'a/(v;  T)  are  the  cumulative  influence  degree  of  node  v  estimated  by 
the  iterative  method  and  the  naive  simulation  method ,  respectively.  We  used  T  =  104 
and  varied  M  from  100,  1,000,  and  10,000. 

In  these  experiments  we  determined  the  values  for  the  diffusion  and  time-delay  pa¬ 
rameters  as  follows.  As  noted  in  3.1,  it  is  required  that  the  diffusion  parameter  k  must 
be  less  than  eig(A)~ the  reciprocal  of  the  largest  eigenvalue  of  the  adjacency  matrix 
A  of  the  network  for  the  infinite  iterative  method  to  obtain  a  finite  value  of  er^.  The 
values  ofV/g(A)”1  for  the  Wikipedia,  Enron,  and  Coauthorship  networks  were  0.00674, 
0.0205,  and  0. 1 05,  respectively.  Thus,  we  adopted  0.0067, 0.02,  and  0. 1  as  the  values  of 
k  for  these  networks,  respectively.  These  are  the  largest  values  that  the  infinite  iterative 
method  can  take.  We  set  r  =  1  for  the  time-delay  parameter.  This  is  equivalent  to  setting 
the  average  time  delay  to  be  a  unit  time  which  is  consistent  to  the  discrete  time  step  of 
the  SIS  with  fixed  time  delay  method. 

Table  1  summarizes  the  results,  from 
which  we  can  see  that  the  estimation  dif¬ 
ference  decreases  as  M  increases  and  it 
becomes  reasonably  small  at  M  =  10,000 
for  all  the  three  networks.  We  are  able 
to  verify  our  speculation  and  conclude 
that  the  proposed  iterative  method  can 
indeed  estimate  the  cumulative  influence 
accurately. 

4.3  Cumulative  Influence  Degree  Comparison 

Next,  we  investigated  how  well  the  other  approaches  can  approximate  the  cumulative 
influence  degree.  We  compared  two  approaches.  One  is  the  infinite  iterative  method  de¬ 
scribed  in  3. 1 .  The  other  is  the  SIS  with  fixed  time  delay  method  [  1 1  ]5.  The  SIS  with  fixed 
time  delay  method  uses  bond  percolation  on  the  layered  graph  which  is  constructed  from 
the  original  social  network  with  each  layer  added  on  top  as  the  time  proceeds!  10]  and 
much  more  efficiently  estimates  the  cumulative  influence  degree  than  the  naive  simula¬ 
tion  method.  We  used  the  same  M  (=  10,000)  from  the  result  in  4.2.  For  each  network, 
we  investigated  two  cases,  one  with  a  short  time  span  T  -  10  and  the  other  with  a 
long  time  span  T  =  100.  Note  that  we  set  r=l  and  thus,  the  average  time  delay  d  =  1 
We  selected  the  top  200  most  influential  nodes  that  the  iterative  method  identified  and 
compared  their  cumulative  influence  degree  with  the  values  that  the  other  two  methods 
estimated  for  the  same  200  nodes. 

Figure  1  illustrates  the  results  of  comparison.  We  can  see  that  the  infinite  iterative 
method  estimate  the  cumulative  influence  degree  fairly  well  for  a  long  time  span  T  = 
100  except  for  the  Wikipedia  network,  but  it  tends  to  overestimate  it  for  a  short  time 
span  T  =  10.  In  contrast,  the  SIS  with  fixed  time  delay  method  tends  to  underestimate 

4  We  had  to  set  the  value  to  be  small  so  that  the  naive  simulation  returns  the  result  within  a 
day. 

5  Note  that  in  [1 1  ]  the  influence  degree  was  defined  to  be  the  expected  number  of  active  nodes 
at  the  end  of  observation  time  7\  but  here  the  algorithm  in  1 1 1 )  is  modified  to  calculated  the 
cumulative  influence  degree. 


Table  1.  Estimation  difference  between  the  iter¬ 
ative  method  (proposed)  and  the  naive  simula¬ 
tion  method 


network 

M 

100 

1,000 

10,000 

Wikipedia 

0.196 

0.062 

0.020 

Enron 

0.552 

0.190 

0.062 

Coauthorship 

0.298 

0.096 

0.03 1 
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(a)  Wikipedia  ( T  -  10) 


rank 

(b)  Enron  ( T  =  10) 


rank 

(c)  Coauthor  ( T  -  10) 


(d)  Wikipedia  ( T  =  UK))  (e)  Enron  (T  =  100) 

Fig.  1.  Comparison  in  cumulative  influence  degrees  of  top 


(D  Coauthor  ( T  =  100) 
200  influential  nodes 


the  cumulative  injluence  degree  for  a  large  time  span  T  =  100  hut  it  does  well  for  a 
short  time  span  T  =  10.  These  results  show  that  these  two  methods  cannot  correetly 
estimate  the  cumulative  influence  degree  for  an  arbitrary  time  span. 

It  is  noted  that  there  are  many  bumps  in  the  graphs  for  the  eases  where  the  estimation 
of  the  other  two  methods  is  very  poor.  i.e.  T  =  10  for  the  infinite  iterative  method  and 
T  -  100  for  the  SIS  with  fixed  time  delay  method.  This  implies  that  the  ranking  results 
by  these  methods  are  different  from  the  true  ranking  by  the  iterative  method.  The  curves 
beeomes  smoother  when  the  estimation  beeomes  better. 


4.4  Efficiency  Evaluation 

We  see  in  4.3  that  both  infinite  iterative  method  and  SIS  with  fixed  time  delay  method 
do  not  accurately  estimate  the  cumulative  influence  degree ,  and  we  compare  the  com¬ 
putation  time  of  the  iterative  method  with  the  naive  simulation  method  for  M  =  1 .  The 
results  are  shown  in  Fig.  2  for  three  values  of  the  time  span  T=  10,  20,  100  and  for  eaeh 
of  the  three  networks.  Three  values  are  ehosen  for  k.  The  minimum  values  are  the  same 
as  the  ones  used  in  4.2  and  4.3,  and  the  other  values  are  obtained  by  multiplying  1 .5  in 
sequence.  The  iterative  method  returns  the  values  in  less  than  0.5  see.  for  all  eases  and 
very  insensitive  to  the  parameter  values.  The  native  simulation  method  is  only  efficient 
when  the  k  is  very  small  and  requires  exponentially  increasing  time  as  k  inerease.  In 
deed  it  did  not  return  the  values  within  3  days  in  many  eases.  Considering  that  this  is 
for  a  single  simulation,  use  of  the  naive  simulation  method  is  not  praetieal  and  out  of 
question. 
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5  Discussions 

We  mentioned  in  3. 1  that  the  cumulative  influence  degree  derived  by  the  infinite  iterative 
method  is  similar  to  the  centrality  proposed  by  Bonacich  [13]  and  identical  to  the  Katz’ 
measure  [  14]  In  [13]  the  standard  centrality  eu  of  node  u  is  defined  by 

Ae"  ~  Yj0'4'**' 

veV 

where  I  is  a  constant  introduced  to  ensure  a  non-zero  solution,  and  A  is  the  adjacency 
matrix  (aH%x  is  its  element)  as  before.  Bonacich  generalized  Eq.  8  by  introducing  the 
strength  of  relationship  /3<  which  is  equivalent  to  k  in  this  paper,  and  derived  the  gener¬ 
alized  centrality  cu(a,fl)  as 

CM(a,P)  =  +  P<'vU*,P))Ou.r*  (9) 

V€  V 

where  a  is  a  normalization  constant.  It  is  easily  shown  that  r,(o\/?)  is  written  in  a  matrix 
notation  as 

oo 

c(a,p)  =  aYJP'AU'  1  =a(Al+pA2l+02Ayl+  -). 

j  o 


(10) 
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Comparing  Eq.  I  with  Eq,  10,  we  note  that  they  are  the  same  except  that  the  generalized 
centrality  assumes  that  the  strength  of  relationship  with  the  directed  connected  nodes  is 
1.  Further,  we  note  that  the  following  equality  holds. 

<rM  =  —c(a,fi),  (11) 

rr 

which  is  exactly  the  same  as  Katz’s  measure.  Thus,  the  cumulative  influence  degree  (Too 
defined  by  the  infinite  iterative  method  is  interpreted  as  a  centrality  measure. 

We  showed  in  4.3  that  the  infinite  iterative  method  well  approximates  the  cumulative 
influence  degree  when  the  time  span  is  large.  This  is  evident  because  the  infinite  iterative 
method  assumes  an  infinite  time  span.  In  the  extreme  limit  of  T  =  oo,  the  iterative 
method  converges  to  the  infinite  iterative  method.  How  large  T  should  be  in  order  for 
it  to  be  large  depends  on  the  delay  time  parameter  r.  When  r  gets  smaller,  a  smaller  T 
can  be  called  large,  e.g.  T  —  10  is  large  when  r  =  0.1.  Similar  argument  can  be  made 
for  the  SIS  with  fixed  time  delay  method.  The  SIS  with  fixed  time  delay  method  advances 
the  time  in  a  discrete  step.  Thus,  it  happens  that  multiple  parents  attempt  to  activate  the 
same  node  simultaneously  at  the  same  time.  If  this  happens,  the  activation  count  is  only 
incremented  by  one.  When  the  time  span  T  is  small,  the  diffusion  propagation  does  not 
go  far  and  there  is  not  much  chance  that  this  simultaneous  activation  happens.  This  is 
why  the  SIS  with  fixed  time  delay  method  gives  good  results  for  a  small  time  span  T. 
However,  how  good  the  SIS  with  fixed  time  delay  method  approximates  the  cumulative 
influence  degree  depends  on  how  close  the  time  step  is  to  the  average  delay-time  6. 
It  overestimates  the  true  cumulative  influence  degree  for  T  =  10  when  r  =  0.1  and 
underestimate  it  when  r  =  10.  We  confirmed  this  by  additional  experiments  but  due  to 
the  space  limit  we  do  not  show  the  figures. 


6  Conclusion 

In  this  paper  we  addressed  the  problem  of  efficiently  estimating  the  cumulative  influ¬ 
ence  degree  of  a  node  in  social  networks  when  the  information  diffusion  follows  the 
Susceptiblc/Infectivc/Susceptible  (SIS)  model  with  asynchronous  continuous  time  de¬ 
lay  based  on  the  independent  cascade  framework.  It  is  possible  to  analytically  derive  a 
formula  by  which  to  iteratively  calculate  the  cumulative  influence  degree  to  a  desired 
accuracy.  The  simplified  version  which  corresponds  to  assuming  an  infinitely  large  time 
span  is  closely  related  to  the  generalized  centrality  measure.  We  showed  by  applying 
the  method  to  three  large  real  world  social  networks  that  the  method  can  accurately  esti¬ 
mate  the  cumulative  influence  degree  with  2  to  6  orders  of  magnitude  less  computation 
time  than  the  naive  simulation  method.  Thus,  it  can  be  used  to  rank  the  influential  nodes 
very  efficiently.  We  also  compared  the  proposed  iterative  method  to  the  SIS  with  fixed 
time  delay  model  and  the  infinite  iterative  method  and  confirmed  that  they  generally 
produce  poor  estimates  and  only  give  good  results  when  a  specific  condition  holds  for 
each 
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Abstract.  I11  this  paper,  we  describe  two  heuristics  for  the  Single  Ve¬ 
hicle  Loading  Problem  (SVLP)  which  can  handle  practical  constraints 
that  are  frequently  encountered  in  the  freight  transportation  industry, 
such  as  the  servicing  order  of  clients;  item  fragility;  and  the  stability 
of  the  goods.  The  two  heuristics,  Deepest-Bottom- Left-Fill  and  Maxi¬ 
mum  Touching  Area,  are  3D  extensions  of  natural  heuristics  that  have 
previously  only  been  applied  to  2D  packing  problems.  We  employ  these 
heuristics  as  part  of  a  two-phase  tabu  search  algorithm  for  the  Three- 
Dimensional  Loading  Capacitated  Vehicle  Routing  Problem  (3L-CVRP), 
where  tlie  task  is  to  serve  all  customers  using  a  homogeneous  fleet  of 
vehicles  at  minimum  traveling  cost.  The  resultant  algorithm  produces 
mostly  superior  solutions  to  existing  approaches,  and  appears  to  scale 
better  with  problem  size. 

Keywords;  vehicle  routing.  3D  packing.  Deepest-Bottom-Left-Fill,  Max¬ 
imum  Touching  Area,  Tabu  Search. 


1  Introduction 

The  Three-Dimensional  Loading  Capacitated  Vehicle  Routing  Problem  (3L- 
CVRP)  was  first  introduced  by  [Ij  and  subsequently  studied  by  [2].  The  task 
is  to  plan  the  routes  for  a  fleet  of  homogeneous  vehicles  that  delivers  items  to 
customers,  such  that  the  total  distance  traveled  by  all  vehicles  is  minimized.  In 
addition,  the  three-dimensional  loading  plan  for  each  vehicle  must  be  formulated 
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that  fulfills  constraints  addressing  issues  such  as  the  stability  and  fragility  of  the 
items  and  the  convenience  of  loading  and  unloading. 

When  searching  for  a  solution  for  3L-CVRP.  the  Single  Vehicle  Loading  (sub-) 
Problem  (SVLP)  must  be  solved  multiple  times.  Our  primary  contribution  in  this 
paper  is  the  introduction  of  two  heuristics  for  the  SVLP.  namely  the  Dee.pest- 
Bottoin-Left-Fill  (DBLF)  and  the  Maximum  Touching  Area  (MTA)  heuristics. 
These  heuristics  are  used  in  a  two-phase  tabu  search  algorithm:  the  first  phase 
attempts  to  make  an  infeasible  initial  solution  feasible,  and  the  second  phase 
attempts  to  improve  the  quality  of  the  solution. 

We  compared  our  DBLF-fMTA  tabu  search  (DMTS)  algorithm  with  the  Tabu 
Search  (TS)  algorithm  employed  by  1]  and  the  Ant  Colony  Optimization  (ACO) 
algorithm  designed  by  [2]  using  a  standard  set  of  27  test  cases.  Our  experiments 
show  that  DMTS  outperforms  TS  in  all  cases,  and  produces  a  superior  solution 
to  ACO  for  22  out  of  the  27  cases.  Furthermore,  the  running  time  for  DMTS 
doc's  not  dramatically  increase  with  problem  size  unlike  TS  and  ACO.  and  it 
converges  to  good  solutions  rapidly. 

2  Related  Work 

The  many  variants  of  the  Capacitated  Vehicle  Routing  Problem  (CVRP)  have 
two  common  aspects,  namely  routing  and  loading.  The  goal  of  routing  is  to 
determine  a  sequence  that  visits  all  customers  with  minimum  total  travel  cost. 
The  goal  of  loading  is  to  find  a  loading  plan  for  each  vehicle  that  satisfies  all 
loading  constraints.  Instead  of  considering  the  actual  packing  of  each  item,  a 
scalar  value  is  usually  used  to  represent  the?  volume  of  each  item;  as  long  as  the 
total  volume  of  the  items  loaded  does  not  exceed  the  vehicle’s  capacity,  it  is 
assumed  that  loading  is  possible  [3]. 

1  was  the  first  to  consider  the  vehicle  routing  problem  with  three-dimensional 
loading  constraints  (3L-CVRP).  A  tabu  search  approach  was  proposed  to  address 
the  3 L- CVRP,  where  the  three-dimensional  loading  sub-problem  was  also  solved 
by  a  tabu  search  metaheuristic.  [2]  employed  a  local  search  to  solve  the  loading 
sub-problem  along  with  an  ant  colony  optimization  routine  to  find  an  overall 
solution  to  the  3L-CVRP  problem. 

There  are  approaches  for  the  2D  loading  problem  that  can  potentially  be 
adapted  to  3L-CVRP.  for  example  [4,5],  but  it  is  unclear  how  such  approaches 
(‘an  handle  practical  constraints. 

The  SVLP  is  related  to  container  loading  problems.  Various  practical  con¬ 
straints  on  the  supporting  surface  or  item  fragility  are  often  considered  [6,7,8]. 
To  date,  the  best  results  on  problems  of  reasonable  size  are  held  by  heuristic- 
based  methods  [9,10,11,12]. 

3  Problem  Description 

Let  G  =  (V.  E)  be  an  undirected  graph,  where  V  =  {().  1.2 . ?*}  is  the  set  of 

n  +  1  vertices  corresponding  to  a  depot,  represented  by  vertex  0,  and  n  clients. 
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denoted  by  vertices  17 ....  n;  and  E  is  the  set  of  edges.  The  cost  of  an  edge  (i,  j)  is 
denoted  by  c?J.  There  are  v  identical  vehicles  available;  each  vehicle  has  a  weight 
capacity  D  and  a  three-dimensional  rectangular  loading  space  S  =  W  x  H  x  L 
defined  by  width  W ;  height  //;  and  length  L.  Each  client  i(i  =  l,...,n)  requires 
the  delivery  of  a  set  of  m7  three-dimensional  items  Itk(k  =  1  2,  having 

width  Wiks  height  hik  and  length  lxk  with  total  weight  d,. 

In  3L-CVRP,  we  assume  all  items  are  rectangular  boxes.  The  items  can  only 
be  placed  orthogonally  inside  a  vehicle;  however,  items  can  be  rotated  by  90°  on 
the  width-length  plane.  Some  items  are  also  marked  fragile. 

The  objective  of  3L-CVRP  is  to  find  a  set  of  at  most  v  routes  (one  per  vehicle) 
such  t  hat 

(1)  Every  vehicle  starts  from  the  depot,  visits  a  sequence  of  clients  and  returns 
to  the  depot; 

(2)  All  clients  are  served  by  exactly  one  vehicle: 

(3)  No  vehicle  carries  a  total  weight  that  exceeds  its  capacity; 

(4)  All  items  for  a  particular  vehicle  can  be  orthogonally  packed  while  satisfying 
the  following  loading  constraints : 

(4.a)  (Fragility  Constraint)  no  non- fragile  items  are  placed  on  top  of  fragile 
items; 

(4.b)  (Supporting  Area  Constraint)  all  items  have  a  supporting  area  of  at  least 
a  percent  of  their  base  area; 

(4.c)  (LIFO  constraint)  all  items  fulfill  the  LIFO  policy,  i.e.,  when  client  i  is 
visited,  all  of  its  corresponding  items  /,*  must  not  be  stacked  beneath 
nor  be  blocked  by  items  of  later  clients.  An  item  is  considered  blocked 
if  it  will  overlap  any  item  of  a  later  client  when  it  is  moved  along  the  L 
axis  towards  the  door. 

(5)  The  total  cost  of  all  edges  in  the  routes  is  minimized 

In  this  study,  we  use  the  following  Cartesian  coordinate  system.  The  loading 
space  of  the  vehicle  is  in  the  first  octant  and  the  origin  is  at  the  deepest,  bot¬ 
tommost,  leftmost  corner.  The  width  W\  the  height  //;  and  the  length  L  is 
parallel  to  the  x-\  y-\  and  2-axis  respectively.  The  terms  left ;  right ;  top ;  bot¬ 
tom:  back ;  and  front  are  self-explanatory.  The  vehicle  has  a  single  door,  which 
is  located  at  the  front  (i.e.,  at  z  =  L). 

4  Procedure  for  the  SVLP 

In  order  to  solve  the  3L-CVRP,  we  must  solve  the  SVLP,  defined  as  follows. 
Given  a  list  of  clients  to  be  visited  in  a  fixed  order,  devise  a  loading  plan  for 
all  items  for  a  particular  vehicle  that  satisfies  all  the  loading  constraints.  We 
can  do  this  by  treating  the  vehicle  as  an  open-ended  bin  (i.e.,  vehicle  of  infinite 
length),  try  to  find  the  minimum  length  to  accommodate  all  the  items,  and  then 
compare  this  value  with  the  length  of  the  vehicle. 

Our  SVLP  procedure  makes  use  of  a  routine  LoadAll.  Given  an  ordered  list 
of  items  L,  LoadAll  returns  the  minimum  length  required  to  load  all  items  into 
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Algorithm  1.  Local  Search  for  t lie  SVLP 


1 

2 

3 

4 

5 

6 


input  :  L:  list  of  items 

if  total  volume  or  weight  of  items  in  /  exceeds  rapacity  of  vehicle  then  return 
failure; 

Sort  L  by  reverse  visiting  order,  then  noii-fragile  first,  then  by  descending 
volume; 

for  k  =  1  to  h  do 

len  *—  LoadAlI(L); 

if  Zen  <  length  of  vehicle  then  return  success; 

Randomly  swap  items  i  and  j  in  L,  i  ^  j; 


7  return  failure; 


the  vehicle.  Assuming  the  existence  of  Load  All,  we  can  use  Algorithm  1  to  solve 
the  SVLP  (we  set  K  =  150  in  our  experiments). 

The  LoadAll  procedure  uses  two  heuristics,  namely  Deepest-Bottom- Left -Fill 
(DBLF)  and  Maximum  Touching  Area  (MTA).  Both  heuristics  attempt  to  pro¬ 
duce  a  plan  that  minimizes  the  length  required  to  load  all  items:  the  LoadAll 
routine  invokes  both  heuristics  and  returns  the  smaller  result. 

5  The  DBLF  Heuristic 

The  Deepest- Bottom- Left-Fill  (DBLF)  heuristic  loads  items  one  at  a  time  by 
placing  the  current  item  in  the  deepest,  bottommost,  left-most  position.  It  is 
an  adaptation  of  the  Bottom- Left-Fill  (BLF)  heuristic  for  2D  packing  [13]  into 
three  dimensions. 

For  any  parking  pattern,  there  is  an  equivalent  pattern  where  each  item  is 
pushed  as  close  to  the  origin  as  possible  called  a  normal  pattern.  Correspondingly, 
given  an  existing  packing  pattern,  all  positions  that  an  item  can  occupy  following 
this  rule  arc  called  normal  positions.  It  is  therefore  sufficient  when  following  the 
DBLF  rule  to  only  try  normal  positions  when  placing  an  item,  i.e.,  the  positions 
where  the  back  face  of  item  i  tout  lies  either  the  front  face  of  some  already  packed 
items  or  the  vehicle;  the  left  side  of  item  i  touches  either  the  right  side  of  some 
already  packed  items  or  the  vehicle;  and  the  bottom  of  item  i  touches  cither  the 
top  of  some  already  packed  items  or  the  vehicle. 

In  this  section,  we  present  two  versions  of  the  DBLF  heuristic.  The  first  ver¬ 
sion  does  not  include  the  supporting  area  constraint,  but  does  handle  the  other 
constraints  (fragility  and  LIFO);  the  second  version  contains  the  modifications 
necessary  to  handle  the  supporting  area  constraint.  Both  versions  run  in  0(n4) 
time,  where  n  is  the  number  of  items  per  vehicle. 

5.1  DBLF  without  Supporting  Area  Constraint 

An  item  i  is  said  to  be  placed  at  (similarly  for  /y,  or  z7)  if is  the  .r- coordinate 
of  the  left  (or  bottom  or  back)  face  of  an  item  i.  Let  i u}  (respectively  ht  or  /*) 
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bo  the  magnitude  of  the  projection  of  item  t  onto  the  .r-axis;  the  ^-coordinate 
of  the  right  (or  top  or  front)  face  is  then  given  by  a u  4  Wj. 

Assuming  i  —  1  items  have  been  loaded,  consider  the  i-tli  item.  If  y2  and  z2 
are  fixed,  then  enumerating  all  candidates  for  x2  and  checking  the  feasibility  of 
candidate  positions  (xj ,  yl ,  zf)  for  non-overlapping  can  be  done  in  a  single  pass  by 
sliding  the  item  from  left  to  right  (see  Algorithm  2).  To  do  so,  we  maintain  three 
lists  of  loaded  items  Lieft,  Ltop,  Lback  Sorted  in  ascending  order,  where  Licjt  is 
sorted  by  x-coordinate  of  the  left  face;  Lt0p  is  sorted  by  //-coordinate  of  the  top 
face;  and  Lbarh  is  sorted  by  ^-coordinate  of  the  back  face.  Note  that  in  line  2 
we  introduced  two  dummy  items  Bottom  and  Back  of  dimensions  L  x  W  x  0 
and  0  x  W  x  H  respectively;  we  initialize  Lf0p  and  Li)nck  to  contain  Bottom  and 
Back  respectively.  Tins  avoids  having  to  check  the  boundary  conditions  of  the 
vehicle  itself. 


Algorithm  2.  DBLF  without  Supporting  Area 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


13 

14 

15 
1G 


input:  List:  list  of  m  items 
Initialize  Lieft  with  no  items; 

Initialize  Ltop<  Lbaek  with  Bottom  and  Back ; 
for  every  item  i  in  List  do 

best  position  pos  <—  (oc,oo,oo); 
best  orientation  (u\/j,/)  <—  (oo,oc.  oc); 
for  each  orientation  w*, /**,/*  of  item  i  do 
for  each  item  j  in  Lbac.k  <1° 

Zi  *—  Zj  4* 

for  each  item  k  in  Ltop  s.t.  yk  4-  hk  4  hi  <  II  do 
Vi  <—  yk  4  hk ; 


p  0.  x?  <—  0; 

while  p  <  size  of  LiPjt  and  xt  4  Wi  <  W  do 
q  he  p-th  item  in  Lie  ft: 
if  xi  4  Wi  <  xq  then  found  xt  and  break; 
if  item  i  overlaps  with  item,  q  then  x,  xt]  4  wq: 

v  />  +  i; 


17 

18 

19 

20 


if  xt  4  wt  <  W  then 

find  feasible  position  (xi,yi,zt); 

update  pos  and  (?e,  /i,  /)  if  (x*,  yt ,  zf)  is  better; 

continue  next  orientation  (line  2): 


21  plaee  item  i  at  pos; 

22  insert  item  i  into  Lieftj  Ltop,  Lbaek: 


23  return  the  largest  z*  4  L: 


Both  the  fragility  and  LIFO  constraints  can  he  handled  without  any  increase 
in  the  time  complexity  by  amending  line  2  so  that  it  also  checks  whether  placing 
item  i  at  x,  violates  the  fragility  and  LIFO  constraints  with  respect  to  item  q  in 
constant,  time.  Algorithm  2  runs  in  G(n4)  time. 
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5.2  DBLF  with  Supporting  Area  Constraint 

With  the  supporting  area  constraint,  it  is  not  sufficient  to  try  only  normal  posi¬ 
tions  to  honour  the  DBL  rule.  Consider  a  normal  position  that  is  feasible  except 
that  the  supporting  area  is  just  below  a%.  If  we  move  the  item  slightly  to  the 
right  or  the  front  before  reaching  the  next  normal  position,  the  supporting  area 
may  now  be  sufficient. 

Assume  that  the  first  i  —  1  items  have  been  loaded.  Let  A(xfy,z)  be  the  total 
supporting  area  contributed  by  the  loaded  items  to  item  i  when  it  is  placed  at 
(.r.i/.z). 

We  first  consider  the  case  where  y  and  2  are  fixed.  Let  Aq(x.y.z)  be  the 
supporting  area  contributed  by  item  q.  Let  Bt  be  the  bottom  face  of  item  i;  Tq 
be  the  top  face  of  item  q\  and  intervals  Iz+  and  Izq  be  the  projections  of  Bt  and 
Tq  on  the  2-axis  respectively.  Obviously.  Aq(x  y,  z)  =  0  if  y  ^  yq  -f  hq  (i.e..  the 
bottom  of  i  is  not  at  the  same  level  as  the  top  of  q)  or  IZ1  fl  Izq  =  0.  Otherwise, 
let  stq  be  a  line  segment  on  the  2-axis  corresponding  to  Izl  fl  lzq. 

We  can  decompose  Aq(j\  y.  z)  into  four  parts: 

(1)  Aq\(.r.  y,  z)  is  the  area  swept  by  from  the  left  side  of  Tq  to  the  right  side 
of  Bi  if  xq  <  x  -f  Wi\  0  otherwise. 

(2)  Aq2 (x,  y,  2)  is  the  area  swept  by  stq  from  the  left  side  of  Tq  to  the  left  side 
of  Bi  if  xq  <  :r;  0  otherwise. 

(3)  Aqx(.i\  y,  z)  is  the  area  swept  by  slq  from  the  right  side  of  Tq  to  the  right 
side  of  Bj  if  xq  -f  wq  <  x  +  Wj ;  0  otherwise 

( 1)  Aqi(.r.  y,  2)  is  the  area  swept  by  siq  from  the  right  side  of  Tq  to  the  left  side 
of  Bt  if  xq  +  u'q  <  x\  0  otherwise. 

Observe  that  Aq(:v,y,z)  =  Aq\ (x, ys  z)  -  A,l2(-r. !/,  z)  -  Aq:i(x,y,  z)  +  Aq\{x,y,z) 
for  any  position  of  i  and  q.  Let  y,  2),  r  —  1 . 2.  3.  4  be  the  sum  of  Aqr(x,.  y,.  2,) 
over  all  items  q;  then  A(.r,  y ,  2)  =  A\ (,r,  y.  2)— A%(x,  y.  z)—A$(x,  y,  2)+A*(;r,  y,  2). 
This  is  a  useful  observation  because  Ar(.r.  y.  2)  can  be  easily  computed. 

For  item  y,  we  call  x*^  —  xq  -  uq;  x*2  —  xq:  .r*3  =  xq  +  wq  -  Wj\  and 
x*q  l  =  xq  +  wq  the  event  points  of  A  A 2,  A 3  and  A.\  respectively.  This  is  because 
when  item  i  slides  from  left  to  right,  after  the  event  point  of  Ari  item  q  starts 
to  contribute  to  Ar. 

Let  AV  be  the  set  of  all  event  points  of  Ar  sorted  in  ascending  order.  Ar(:u,  y,  z) 
is  a  piecewise  linear  function  of  x  with  local  maxima  achieved  at  x  G  AV-  Note 
that  the  slope'  between  two  consecutive  event  points  is  constant.  Let  SZr(x,y,  z) 
be  the  slope  of  Ar  at  (j\ y,  2);  since  the  contribution  by  item  q  to  Ar  changes 
only  at  the  event  points,  SZr(x,y<  z)  is  a  step  function  of  :r. 

Let  X  be  the  set  of  all  event  points  in  Ar , V  =  1,2, 3, 4  sorted  in  ascending 
order.  A(x.  y,  2)  is  a  piecewise'  linear  function  of  x  wit  h  maxima  achieved  at  some 
x  €  AV 

When  item  /  slides  along  the  2-axis  with  .r,  y  fixed,  we  can  also  define  the  event, 
points  Zv  for  Ar  and  the  slope  function  SXr(;i\  y,  2).  Using  similar  arguments, 
we  find  that  Ar(xsy,z)  is  a  piecewise  linear  function  of  2  with  local  maxima 
achieved  at  2  G  Zr\  SXr(x,y<  z)  is  a  stop  function  of  2;  and  if  Z  is  the  set  of  all 
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event  points  in  Zr  sorted  in  ascending  order,  then  A(x.  y .  z)  is  a  piecewise  linear 
function  with  local  maxima  at  z  E  Z  Hence,  Lemma  1  holds: 

Lemma  1.  When  y  is  fixed ,  the  local  maxima  of  A(x,  y,  z)  is  at  some  (x,  z)  € 
A"  x  Z,  where  X  and  Z  are  event  points. 

Once  the  values  of  Ar{xyy,  z)  and  Zr(x,  ?/,  z)  at  all  pairs  (x,  z)  in  X  x  Z  are 
known,  then  A(x,  y,  z)  can  be  computed. 

Let  NO(x,y,  z);  F(x,  y,z);  and  LIFO(x.y>  z)  be  indicator  functions  where 
1  indicates  that  placing  an  item  at  (x,  y,  z)  will  satisfy  the  non-overlapping; 
fragility;  and  LIFO  constraint  with  respect  to  all  other  items,  respectively,  and 
0  otherwise.  These  three  functions  can  be  efficiently  computed;  we  will  use 
N()(x.  y,  z)  for  illustration. 

We  can  divide  the  base  of  the  vehicle  into  \X\  x  |Z|  grid  squares  using  lines 
parallel  to  the  z-  and  x-axis  that  pass  through  points  in  X  and  Z  respectively. 
The  following  lemma  is  readily  verified. 

Lemma  2.  NO(x\ ,  y,  z\)  =  N0(x2,  y,  zf)  if  (.n,zi)  and  (x^,  z%)  are  in  the  in¬ 
terior  of  the  same  yrid .  The  same  is  true  for  F(x,  y .  z)  and  LIFO(x,y,  z). 


v  a; 

*  i 

-  -  z  -t: 

-  -  z, 

- 

Fig.  1.  Illustration  of  proof  of  Theorem  l 
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Theorem  1.  If  (xt,  yM  zf)  is  the  best  DBL  position  for  item  i  that  satisfies  the 
non-overlapping;  fragility;  and  LIFO  constraints ,  then  either  xt  E  X  or  zt  E  Z. 

Proof  Assume  on  the  contrary,  i.c..  the  best  position  a  =  (xt ,  y  * ,  z  * )  is  inside  a 
grid  square  (Figure  1).  By  Lemma  2,  any  position  (in  particular  b.  c,  d)  in  the 
same  square  will  also  satisfy  the  non-overlapping;  fragility;  and  LIFO  constraints. 

Suppose  we  slide  item  i  leftwards  along  the  line  z  =  z*.  The  supporting  area 
A(x,ylyzt)  is  a  linear  function  of  x.  If  the  slope  is  not  positive,  then  position 
b  will  have  a  larger  or  the  same  supporting  area  as  a.  Since  b  is  feasible,  this 
contradicts  the  fact  that  (x,,  y,,  z,)  is  the  leftmost  position;  hence,  the  slope  must 
be  positive. 

Now  suppose  we  slide  item  i  along  the  line  z  =  zt  by  6  towards  the  right  to 
position  r.  Since  the  slope  is  positive,  the  supporting  area  at  position  c  is  strictly 
greater  than  a%.  At  this  point,  we  can  slide  the  item  along  x  =  x*  ±6  by  c.  Since 
A(xr  -f  6 ,  iji,  z)  is  a  linear  function  of  z,  we  can  find  a  position  d  with  supporting 
area  greater  than  or  equal  to  at  a.  Since  position  d  is  deeper  than  (xi,y*,  Z*),  a 
docs  not  respect  the  DBL  rule.  Hence,  cither  xt*  E  X .  or  zt-  E  Z. 
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Let  cap(x,  X)  =  inin{?*  >  .r  :  r  G  A'}  be  a  function  that  returns  the  smallest 
number  in  A"  that  is  greater  than  x.  Using  a  similar  argument  as  Lemma  2: 

Lemma  3.  z)  is  the  best  DBL  position  for  item  i  and  satisfies  all  loading 

constraints,  then  (cap(x,  X),y.  cap(z.  Z))  is  a  feasible  position. 

Theorem  1  and  Lemma  3  implies  that  there  exists  a  best  position  for  item  /, 
(xi,y,  Zi).  that  is  either  the  result  of  pushing  item  i  from  location  (xHy.z)  along 
the  2-axis  to  its  deepest  position  or  along  the  r-axis  to  its  leftmost  position 
in  the  grid  square.  Consequently  we  only  need  to  search  for  feasible  positions 
2),  V(a?,  z)  £  X  x  Z. 

hi  order  to  find  the  best  DHL  position,  we  first  scan  all  possible  2  £  Z,  then 
y  £  Y  in  ascending  order.  For  each  2,y  we  examine  all  x  £  X.  If  a  feasible 
position  is  found  with  supporting  area  larger  than  a%,  then  it  is  possible  that 
item  1  can  be  pushed  deeper  and/or  to  the  left.  Let  enp{z,Z)  be  the  largest 
element  smaller  than  2  in  Z;  we  can  push  the  item  deeper  only  if  at  the  position 
(.r,  y,  cup (2,  Z)),  all  indicators  NO ;  F:  and  LIFO  are  I.  or  to  the  left  if  all 
indicators  are  1  at  the  position  (cup(x,  X),y,  z). 

The  values  of  Ar;  Z;  all  indicator  functions;  and  the  supporting  area  can 
be  computed  in  linear  time.  Pushing  item  i  deeper  or  to  the  left  can  be  done 
in  constant  time  (since  A(.v,  y,  2)  is  a  linear  function  of  2  and  &*).  The  time 
complexity  to  load  a  single  item  is  therefore  0(rr*);  hence,  to  load  all  n  items, 
the  total  time  complexity  is  0(n4). 

6  Maximum  Touching  Area  Heuristic 

The  Maximum  Touching  Area  (MTA)  heuristic  places  items  into  the  vehicle  at 
the  position  that  maximizes  the  total  contact  area  of  its  bices  with  the  faces  of 
other  items  or  with  the  vehicle.  It  is  an  extension  of  the  Maximum  Touching 
Perimeter  heuristic  for  2D  packing  [14]  into  three  dimensions. 

Let  Aicjt (x,  y,  2);  Aright (iC,  y,  2);  Ajront  («r,  y,  2);  A^ck (t,  y,  2);  A t0p{-i\  y*  z)\ 
and  Ahottoinix,  y,  2)  be  the  contact  area  of  the  left;  right;  front;  hack:  top;  arid 
bottom  faces  if  the  current  item  is  placed  at  (,r,y,  2),  respectively.  The  total 
touching  area  A*(x.y.z)  of  the  current  item  placed  at  (.r,y.  2)  is  the  sum  of 
these  six  functions. 

Theorem  2.  Given  a  current  loading  patient,  there  exists  a  position  (.r,  17,2), 
x  £  X.  y  £  Y.  z  £  Z  to  place  the  current  item  such  that  zl*(.r,  y.  z)  is  maximal. 

Proof.  Consider  A+{x,y,z)  for  arbitrary  fixed  y,z.  Observe  that  A/roni(x<  y,  2); 
Aback {x*  y,  z):  Atop(.i\  y,  2);  and  y.  2)  are  all  piecewise  linear  functions 

of  x  with  extreme  points  at  x  £  X.  The  sum  of  these  four  functions  is  also 
a  piecewise  linear  function  of  x  with  extreme  points  at  ,r  £  X.  Furthermore, 
Au-ft{x,u-  2)  and  Arjgtlt(x,y.  z)  can  only  be  non-zero  if  .r  £  X.  Therefore,  the 
local  maxima  for  y,  2)  with  arbitrary  fixed  y,  z  must  he  when  x  £  X . 

Without  loss  of  generality,  the  same  applies  when  considering  A*(j\y,  2)  for 
arbitrary  fixed  .r,  y  or  arbitrary  fixed  x .  2.  Therefore,  the  global  maximal  for 
A*(.r.  y,  2)  must  be  at  some  position  (./\  y.  2)  w  here  x  £  A\  y  £  Y .  2  £  Z. 
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We  can  adapt  the  DBLF  algorithm  to  find  the  position  that  maximizes  total 
contact  area.  Aside  from  the  packing  rule,  there  are  two  differences  between 
MTA  and  DBLF:  1)  for  MTA  we  must  scan  all  (x,y,z)  G  A  x  Y  x  Z,  whereas 
in  DBLF  we  can  stop  as  soon  as  a  feasible  position  has  been  found;  and  2)  for 
MTA  we  need  not  push  the  current  item  along  the  grid  lines  unlike  for  DBLF. 


7  A  Two-Phase  Tabu  Search  for  the  3L-CVRP 


We  adapted  the  savings  algorithm  for  CVRP  [15]  to  construct  our  initial  solution. 
Starting  with  one  client  per  route,  we  iteratively  merge  two  routes  with  the 
largest  traveling  time  savings  until  no  more  merging  can  be  done.  If  after  the 
first  round  of  merging  there  are  more  routes  than  vehicles,  then  we  perform 
another  round  of  merging,  where  we  allow  both  the  total  weight  of  items  and 
the  length  of  loading  space  to  exceed  vehicle  capacity;  the  second  round  ends 
when  the  number  of  routes  and  vehicles  are  equal. 

We  employed  five  neighbourhood  operators,  namely: 

—  2-opt:  select  a  pair  of  clients  (i.j)  from  a  route;  the  order  of  all  clients 
between  i  and  j  inclusive  are  reversed. 

—  2-swap:  select  a  pair  of  clients  from  a  route  with  at  least  3  clients;  the  order 
of  the  selected  pair  is  swapped. 

—  move:  select  a  client  from  route  Rt  and  an  insertion  point  from  route  Rj,  j  ^ 
t;  the  client  is  deleted  from  /?,  and  inserted  into  R3  at  the  insertion  point. 

—  crossover:  select  a  splitting  point  from  each  of  two  routes  Rt  and  Rj  (with 
at  least  two  clients);  the  prefix  sequences  of  the  two  routes  are  exchanged, 
splitting:  select  a  splitting  point  from  a  route  /?*  (with  at  least  two  clients); 
Rj  is  split  into  two  routes  at  that  point. 

The  five  neighbourhood  operators  2-opt;  2-swap;  move;  crossover;  and  split¬ 
ting  are  assigned  a  hand-tuned  weight  of  1000;  1000;  3000;  4500;  and  500  respec¬ 
tively,  which  provides  the  relative  probability  that  the  operator  will  be  applied. 
We  set  the  tabu  tenure  T  to  30  for  both  phases.  These  values  were  determined 
after  some  preliminary  investigation. 

Phase  one  uses  only  the  2-swap,  move  and  crossover  operators,  and  is  invoked 
only  if  the  initial  solution  is  infeasible.  The  following  objective  function  obj  to 
captures  the  excess  weight  and  length: 


(1) 

(2) 
(3) 


obj  —  route  len.  +  a  ■  excess  wt.  +  (3  ■  excess  len. 


a  =  20r/D 
(3  =  20c/  L 


where  c  is  the  average  cost  of  the  edges;  D  is  the  capacity  of  a  vehicle:  and  L  is 
the  length  of  a  vehicle.  The  values  of  a  and  (3  are  increased  if  no  progress  is  made 
in  10  consecutive  iterations;  a  will  be  increased  by  50%  if  some  route  exceeds 
the  weight  capacity  D  while  /?  will  be  increased  by  50%  if  some  route  requires  a 
vehicle  with  length  greater  than  L.  We  sample  500  neighbours  in  each  iteration, 
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and  the  Lost  solution  is  selected  as  the  current  solution  for  next  iteration.  Phase 
one  continues  until  a  feasible  solution  is  found,  or  10,000  iterations  are  reached. 

Phase  two  uses  all  five  operators,  and  only  feasible  solutions  are  generated.  In 
each  iteration,  500  neighbours  are  generated,  and  the  one  with  minimum  total 
traveling  cost  is  selected  as  the  current  solution  for  the  next  iterat  ion.  The  best 
solution  found  after  10,000  iterations  is  retained. 


8  Computational  Experiments 

Onr  DBLF+MTA  tabu  search  (DMTS)  algorithm  was  coded  in  C++  using  the 
g++  compiler.  It  was  tested  on  a  Hewlett-Packard  server  with  an  Intel  Xeon 
E5430  2.06  GHz  CPU,  8  GB  RAM  and  running  Linux  (CentOS  5.1)  using  the 
27  instances  proposed  by  [1  .  They  can  be  broadly  divided  into  three  categories: 
small  instances  1  to  9  have  15-25  customers  with  32-50  items;  medium  instances 


Table  1.  Performance  of  TS  vs  A  CO  vs  DMTS 


No 

TS 

ACO 

DMTS 

z 

time(s) 

avg 

tiine(s) 

min 

avg 

max 

tiine(s)  Impr 

1 

316.32 

129.50 

305.35 

11.2 

301.71 

301.77 

302.02 

193.05  -1.17% 

2 

350.58 

5.30 

331.90 

0.1 

334.96 

334.96 

334.96 

9.32  0.007c 

3 

447.73 

461.10 

409.79 

88.5 

387.34 

387.91 

387.97 

87.05  -5.34% 

4 

448.48 

181.10 

440.68 

3.9 

437.19 

438.59 

440.68 

91.30  -0.47% 

5 

464.24 

75.80 

453.19 

22.7 

436.18 

440.23 

445.09 

444.72  -2.86% 

6 

504.16 

1 167.90 

501.47 

17.5 

498.32 

499.48 

501.05 

125.98  -0.40% 

7 

831.66 

181.10 

797.47 

51.4 

707.46 

771.09 

772.87 

394.90  -3.31% 

8 

871.77 

156.10 

820.67 

56.2 

803.98 

805.95 

807.75 

331.02  -1.79% 

9 

666.10 

1468.50 

635.50 

15.3 

630.13 

630.90 

634.00 

197.90  -0.72% 

10 

911  10 

714.00 

841.12 

241.2 

826.39 

832.46 

836.52 

707.07  -1.03% 

11 

819.36 

396.40 

821.04 

172.4 

768.25 

781.85 

788.60 

820.90  -4.58% 

12 

651.58 

268.10 

629.07 

46.2 

610.23 

614.78 

619.43 

194.76  -2.27% 

13 

2928.34 

1639.10 

2739.80 

235.4 

2697.70 

2715.82 

2725.97 

859.47  -0.88% 

14 

1559.64 

3151.60 

1472.26 

623.8 

1428.99 

1456.13 

1483.45 

1638.78  -1.10% 

15 

1452.34 

2327.40 

1405.48 

621.0 

1352.94 

1371.26 

1382.08 

1537.39  -2.43% 

10 

707.85 

2550.30 

698.92 

12.8 

698.61 

699.54 

703.35 

46.55  0.097c 

17 

920.87 

2142.50 

870.33 

11.8 

871.63 

875.19 

877.72 

731.74  0.507 

18 

1400.52 

1452.90 

1261.07 

2122.2 

1227.07 

12  18.28 

1276.74 

1748.84  -1.01% 

19 

871.29 

1822.30 

781.29 

614.3 

762.47 

776.35 

795.72 

1376.97  -0.63% 

20 

732.12 

790.00 

611.26 

3762.3 

583.45 

593.17 

606.28 

1647.83  -2.96% 

21 

1275.20 

2370.30 

1124.55 

5140.0 

1094.78 

1121.60 

1150.11 

1594.57  -0.26% 

22 

1277.94 

1611.30 

1 197.43 

2233.0 

1170.89 

1176.76 

1 194.43 

1287.71  -1.73% 

23 

1258.16 

6725.60 

1171.77 

3693.4 

1137.90 

1 148.02 

1161.95 

1091.05  -2.03% 

24 

1307.09 

6619.30 

1148.70 

1762,8 

1132.05 

1144.56 

1157.18 

469.80  -0.36% 

25 

1570.72 

5630.90 

1436.32 

8619.7 

1434.00 

1457.09 

1469.05 

1582.82  1.457c 

26 

1847.95 

4123.70 

1616.99 

6651.2 

1606.85 

1610.01 

1632.61 

1488.72  -0.21% 

27 

1747.52 

7127.20 

1573.50 

10325.8 

1551.68 

1574.23 

1600.80 

1440.18  0.057c 

Avg 

1042.26 

2058.80 

966.66 

1746.6 

946  13 

956.10 

966.26 

820.02  -1.31% 
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10  to  18  have  29-44  customers  with  62-94  items;  and  large  instances  19  to  27 
have  50-100  customers  with  99-198  items. 

We  compared  DMTS  against  the  reported  results  of  tabu  search  (TS)  by  [1] 
and  the  ant  colony  optimization  (AGO)  by  [2].  The  results  for  TS  were  obtained 
on  a  Pentium  IV  3  GHz  PC  with  512MB  RAM  running  Windows  XP,  and  the 
results  for  AGO  were  obtained  on  a  Pentium  IV  3.2GHz  with  2GB  R  AM  running 
Linux.  The  CPU  time  limit  for  these  two  approaches  was  set  to  1800  seconds  for 
small  instances*  1-9;  3600  seconds  for  medium  instances  10-18;  and  7200  seconds 
for  large1  instances  19-27.  Since  TS  is  deterministic,  it  was  invoked  once  for  each 
instance.  Both  ACO  and  DMTS  were  invoked  10  times  with  different  random 
seeds  on  each  run,  and  the  average  performance  is  reported. 

The  results  are  given  in  Table  1.  The  tirne(s)  columns  report  the  time  taken 
to  produce  the  best  solution  for  each  algorithm.  Column  2  is  the  cost  of  the 
best  solution  found  by  TS.  For  ACO,  the  average  total  traveling  cost  over  10 
executions  is  reported.  We  also  report  the  minimum  and  maximum  values  for 
the  total  traveling  cost  for  DMTS;  the  last  column  Impr  gives  the  percentage 
improvement  of  the  average  of  DMTS  over  the  better  result  of  TS  and  ACO; 
negative  values  indicate  an  improvement  since  the  objective  of  this  problem  is 
to  minimize  the  overall  traveling  cost. 

Onr  experiments  show  that  DMTS  outperforms  both  TS  and  ACO  in  all 
instances  except  2;  16;  17;  25;  and  27.  The  average  improvement  is  about  1.31%. 
We  also  see  that  the  running  times  for  TS  and  ACO  do  not  scale  well,  increasing 
dramatically  as  the  size  of  the  instances  increases  In  contrast.  DMTS  maintains 
a  similar  running  time  for  both  the  large  and  medium  instances.  Furthermore, 
the  convergence  rate  for  DMTS  is  very  quick;  in  the  majority  of  cases,  a  high 
quality  solution  can  be  found  within  the  first  1000  iterations. 

9  Conclusions 

In  this  study,  we  examined  the  Three-Dimensional  Loading  Capacitated  Vehi¬ 
cle  Routing  Problem.  We  extended  two  heuristics  for  the  loading  sub- problem, 
namely  Deepest-Bottom-Left-Fill  and  Maximum  Touching  Area;  for  the  over¬ 
all  algorithm,  we  employed  a  two-phase  tabu  search.  Experiments  showed  that 
our  DMTS  algorithm  outperforms  the  best  known  algorithms  in  22  out  of  27 
instances  and  is  significantly  faster  for  large  cases.  Furthermore,  the  algorithm 
converges  very  quickly,  so  high  quality  solutions  arc  discovered  even  in  the  very 
early  stages  of  the  search. 

The  DBLF  and  MTA  heuristics  are  natural  and  logical  ways  to  solve  SVLP. 
They  are  sequence-based  approaches  that  mimic  the  loading  of  a  vehicle,  and 
can  effectively  address  the  various  constraints  as  each  item  is  loaded.  Although 
DMTS  can  be  refined  to  produce  better  solutions  (e.g.,  by  introducing  a  branch 
and  bound  post-optimization  step),  it  was  primarily  used  to  demonstrate  the 
effectiveness  of  these  heuristics,  which  can  potentially  be  adapted  to  any  problem 
that  involves  the  SVLP  with  practical  constraints. 
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Abstract.  Manifold  clustering  finds  wide  applications  in  many  areas. 
In  this  paper,  wrc  propose  a  new  kernel  function  that  makes  use  of  Ric- 
lnaniiian  geodesic  distances  among  data  points,  and  present  a  Geometric 
median  shift  algorithm  over  Riemannian  Manifolds.  Relying  on  the  ge¬ 
ometric  median  shift,  together  with  geodesic  distances,  onr  approach 
is  able  to  effectively  cluster  data  points  distributed  over  Riemannian 
manifolds.  In  addition  to  improving  the  clustering  results,  the  complex¬ 
ity  for  calculating  geometric  median  is  reduced  to  0(n2),  compared  to 
O(n2logn2)  for  Tukey  median.  Using  both  Riemannian  Manifolds  and 
Euclidean  spaces,  we  compare  the  geometric  median  shift  and  mean  shift 
algorithms  for  clustering  synthetic  and  real  data  sets. 


1  Introduction 

Manifold  learning  attracts  more  and  more  attentions  in  recent  years.  It  can  be 
applied  to  wide  areas,  such  as  manifold  clustering  [4,6,7],  which  cannot  achieve 
satisfactory  results  in  Euclidean  spaces.  In  particular,  there  are  many  works  on 
mean-shift  clustering  in  Euclidean  space  and  Manifolds  [8].  The  mean  shift  is 
also  widely  applied  to  computer  vision  applications,  such  as  feature  analysis  [lj 
and  image  segmentation  [5].  Mean  shift  clustering  is  a  non-parametric  cluster¬ 
ing  algorithm  which  is  based  on  the  nonparainetric  estimation  of  a  probability 
density  function.  The  value  of  the  density  function  at  a  point  can  be  estimated 
using  the  observed  samples  that  fall  within  a  small  region  around  that  partic¬ 
ular  point.  A  shift  window  is  used  for  density  estimation.  Some  points  can  be 
classified  into  the  same  cluster  when  they  converge  to  the  same  point  in  the 
mean  shift  iteration  process.  However,  the  convergent  point  may  not  happen  to 
be  one  existing  element  of  the  dataset.  Compared  with  the  mean,  the  Geometric 
median  is  always  robust  to  outliers.  Besides,  it  is  the  true,  existing  element  in 
the  dataset.  This  will  lead  to  choosing  the  different  shifting  point  to  be  updated 
between  the  mean  shift  and  Geometric  median  shift,  as  shown  iri  Figure  1. 

In  general,  the  median  of  points  often  locates  on  the  large  density  region  in  the 
data  set.  such  as  geometric  median  [3]  and  Tukey  median  [2].  Similar  to  mean- 
shift,  the  median-shift  algorithm  in  the  Euclidean  vector  space  is  proposed  [6]. 

B.-T.  Zhang  and  M.A.  Orguii  (Eds.):  PRICAI  2010,  LNAI  6230,  pp.  268  279.  2010. 

(c)  Springer- Verlag  Berlin  Heidelherg  2010 
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Fig.  1.  Comparison  between  geometric  median  shift  and  mean  shift.  The  current  point 
will  be  respectively  updated  as  the  green-black  point,  by  geometric  median  shift,  which 
is  the  true  point  in  the  original  data  set,  and  as  the  blue  point  by  mean  shift,  which  is 
not. 


Differing  from  the  definition  of  Geometric  median  in  [3],  we  define  the  geometric 
median  as  the  true  point  in  a  data  set  instead  of  the  non-existing  point  as 
presented  in  [3],  which  is  the  least  sum  of  squared  distances  from  the  point  to 
others. 

Geometric  median,  however,  cannot  describe  the  median  of  points  on  a  man¬ 
ifold,  which  frequently  occurs  in  non-vector  space  [9].  The  distances  between 
pair-wise  points  on  a  Riernannian  manifold  cannot  be  accurately  calculated  by 
Euclidean  distances,  but  rather  by  Geodesic  distances.  One  of  the  reasons  for 
this  lies  in  different  underlying  metric  spaces  between  manifolds  and  Euclidean 
space.  I  he  geodesic  distance  between  two  points  is  equal  to  either  the  length 
of  the  lines  connecting  them  in  Euclidean  spaces,  or  their  direct  Euclidean  dis¬ 
tance.  This  depends  on  the  shape  and  the  metric  of  the  manifolds  II].  Using 
geometric  median,  we  applied  geometric  median  shift  over  Rieniannian  mani¬ 
folds  to  clustering  in  this  paper.  We  make  three  contributions  as  follows:  i)  we 
propose  the  new  kernel  function  that  calculates  Rieniannian  geodesic  distances 
over  Rieniannian  manifolds,  ii)  We  introduce  the  geometric  median  shift  vector 
over  Rieniannian  manifolds,  iii)  A  new  geometric  median  shift  algorithm  over 
Rieniannian  manifolds  is  also  presented .  Experiments  are  reported  to  demon¬ 
strate  the  performance  of  our  method  on  synthetic  and  real  data  sets.  We  also 
make  comparisons  to  other  algorithms  including  mean  shift  and  median  shift  in 
Euclidean  spaces. 

2  Rieniannian  Metric  in  Local  Coordinates 

Define  X  as  a  smooth  manifold  with  a  Rieniannian  metric  <j  on  X .  This  means 
that  each  point  p  €  X ,  which  defines  gp(x,y)  :  Tp X  x  Tv X  — *  R  [5],  where 
R  is  a  real  number  set,  Tv X  denotes  the  tangent  space  of  point  p  in  A\  and 
gp{x,  y)\s  a  symmetric,  positive  definite  and  bilinear  map.  In  addition,  we  set 
and  Vj  to  he  the  basis  of  the  tangent  spaces  Tp X  at  point  p  as  shown  in  Figure  2. 
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X 

Fig.  2.  Tangent  space  at  the  point  p  on  a  manifold  X ,  with  basis  vectors  t’i  and  V2  in 
the  tangent  space 


Thus  in  any  local  coordinate,  a  metric  is  completely  determined  by  the  function 
(ji j  (p)  which  may  he  regarded  as  the  coefficients  of  a  positive  definite  matrix. 


Moreover,  the  length  of  any  piece  wise  smooth  curve  c  :  [a,  6]  — ♦  M  with  c(a)  =  p 
and  c(b)  =  q  is  defined  as: 

lengt.h[c]  :=  f  y/ge(t)(c  (t),c  (t))dt  (1) 

J  a 

where  c  (t)  is  the  gradient  of  c(t).  On  the  basis  of  Eq.l,  for  any  points  p  and  q  on 
manifold.  Let  C(p,q)  denote  the  space  of  piecewise  smooth  curves  c  :  [a,  b]  — ►  M 
with  r(a)  =  p  and  c(6)  —  q.  We  can  obtain  the  distance  from  any  point  p  and  q 
on  a  manifold  denoted  by  d(p,  q)  as 

d(p,  q)  =  inf{len<jt.h[c]\c  €  C(p,  q)}  (2) 


The  distance  between  a  pair  of  points  is  defined  as  the  great  est  lower  bound  of  the 
lengths  of  curves  which  connect  those  points.  We  implement  geodesic  distance 
function  d(x.  y)  between  points  x  and  y  on  a  manifold  using  the  method  in  [9] 
and  [10],  respectively. 

3  Geometric  Median-Shift  on  Riemannian  Manifolds 

In  this  section,  we  introduce  geometric  median,  making  a  comparison  to  Tukey 
median  that  is  the  point  with  the  largest  tukey  depth  [2]  in  the  point  set.  The 
Euclidean  and  geodesic  distances  are  applied  to  the  points  in  Euclidean  space  and 
those  on  manifolds,  respectively.  Compared  with  the  mean-shit  on  a  Riemannian 
manifold  and  Euclidean  spaces,  the  geometric  median  of  points  on  manifolds  is 
indeed  a  true  point.  Besides,  points  distributed  on  a  Riemannian  manifold  are 
always  discrete,  which  cannot  be  computed  by  the  mean-shift  vector  on  manifolds 
and  Euclidean  spaces.  In  this  section,  the  mean  is  regarded  as  a  virtual  point 
as  opposed  to  a  median  point.  In  other  words,  when  referred  to  a  mean  shift 
iterat  ion,  the  space  is  regarded  as  a  continuous  one  in  order  to  utilize  the  gradient 
operator.  On  the  basis  of  that,  we  extend  each  discrete  point  on  a  manifold  to 
its  neighborhood,  and  then  unite  neighborhoods  of  discrete  points  contained  in 
the  union  of  open  sets. 
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We  start  a  geometric  median  shift  procedure  over  Riemaimian  manifolds  from 
any  point  distributed  on  the  manifold  over  the  open  continuous  space.  We  shift 
the  point  along  the  curves  with  the  special  direction.  Using  the  definition  of 
geometric  median  on  a  manifold,  we  proposes  the  kernel  density  estimate  with 
profile  At.  bandwidth  h  and  Ricmannian  metric  d(x,y).  The  geometric  median 
has  the  property  with  the  minimum  sum  of  squared  geodesic  distances  to  other 
points  [3].  We  define  the  kernel  density  estimate  function  for  geometric  median 
on  Riemann ion  manifolds  as 


h(y)  = 


Cm 

1) 


2^a'( — — My) 


h2 


(3) 


where  A*(-)  is  a  flat  kernel  function  with  value  1,  if  0  <  x  <  1,  and  0  otherwise. 
<f{y)  is  another  kernel  function  related  to  the  sum  of  square  Rieinanniaii  geodesic 
distances  from  point  y  to  other  points  distributed  on  manifolds.  The  and 
kernel  function  tp(y)  are  enforced  to  ensure  Fk(y)  to  he  a  convex  probability  den¬ 
sity  function.  The  reason  why  we  choose  the  sum  of  squared  geodesic  distances 
is  that  it  is  desirable  to  do  the  calculation  especially  for  applying  to  the  large 
number  of  points  on  a  manifold  as  opposed  to  the  geodesic  distance  between  the 
points  on  Rieniannian  manifolds  or  Euclidean  distance  in  Euclidean  spaces. 

Theorem  1.  h\(y)  is  convex  if  ip(y)  is  a  convex  fund  ion. 

Proof  It  has  been  proven  in  [3]  that,  a  squared  geodesic  distance  to  any  .r,  is 
convex.  With  the  convex  kerned  function  k(  -  )  with  the  condition  that  <p(y) 

is  a  convex  function,  so  Fk(y)  can  be  taken  as  multiplication  between  convex 
kernel  functions.  It  is  obvious  that  Fk{y)  is  a  convex  function.  ■ 

Before  defining  the  kernel  function  <p{y)  in  Eq.  3,  we  make  a  comparison  between 
different  kernel  functions,  as  shown  in  Figure  3. 

From  Figure  3,  we  choose  Gaussian  kernel  function  and  define  ^p(y)  as. 

n 

<e(y)  =  ^exp(-d2(?/,a-,))  (i) 

i=\ 


where  d(y,x)  is  the  geodesic  distance  between  data  points  y  and  data  point  .r 
distributed  on  Rieniannian  manifolds.  Combining  Eq.  (3)  and  Eq.  (4),  we  can  get 
t he  sum  of  squared  geodesic  distances.  The  final  kernel  function  for  geometric 
median  over  Rieniannian  Manifolds  is. 


Ck(y)  = 


£kji 

n 


Y  k(-  ^f'^)exp{-d2{y,xt)) 


The  gradient  of  Eq.5  is. 


vh{y)  =  (s( 

+cxp(-<l2(y,Xi))logy(xi)k(^j^)) 


ftr 


>  ^  ^37<(-rf2(v,  x,)) 


(5) 


(G) 
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Fig.  3.  Comparison  of  different  kernel  functions.  Triangle  kernel  function  cannot  be 
derivable  at  the  point  (0,1).  Quart ic  kernel  function  is  not  a  strictly  monotonous  de¬ 
creasing  function.  Triweight  kernel  function  does  not  have  a  significant  change  when 
the  x-coordinate  of  points  is  close  to  1  or  -1.  Epanechnikov  kernel  function  is  nega¬ 
tive.  Gaussian  kernel  function  Inis  the  property  wit  h  non-negative  value,  derivable  and 
monotonous  decreasing  convex. 


where  g(x)  —  —  k  (x).  Further,  we  denote  ip(y,  .x,)  as 


=  —  tb*  -fj.7 >(-d2{y,Xi))  +  exp{-d2(y.Xi))k(—U~^-)  (7) 


/i2 


On  the  basis  of  Eqs.  (5)  ((>)  and  (7).  we  give  the  geometric  median  shift  vector 
over  in  tangent  spaces  and  on  Rieinannian  Manifolds  in  Eq.  (8)  and  Eq.  (9), 
respectively.  The  computation  of  the  functions  Maw  folds  T(y)  and  Iogx(y)  are 
implemented  by  the  method  in  f9]  and  [10],  respectively. 


^fh-tangcnt(y)  — 


Mvai) 


(8) 


^ hi— mam.  !„ld{y)  -  Manifoldsy(aMh-tangeni(y ))  (9) 

Figure  4  illustrates  the  meanings  of  notations  used  in  the  above  equations. 


Fig.  4.  The  function  logx(?/)  means  the  vector  that  lies  in  tangent  space  TXM  at  point 
x.  The  set  Mamfoldsx(y)  includes  the  points  on  the  Rieinannian  manifolds  along  the 
curve  starting  from  point  x,  and  the  tangent  vector  of  the  curve  is  y  at  the  tangent, 
space  TXM. 
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Similar  to  the  stop  size  used  in  [3],  we  sot  0  <  a  <  2  to  ensure  that  the  point 
on  Rieniannian  Manifolds  can  iterate  to  a  converge  point.  Specifically,  we  adopt 
o  =1  in  our  experiments.  The  final  stable  point  in  the  original  point  set  that 
minimizes  the  sum  of  squared  geodesic  distances  of  other  points  on  a  Rieniannian 
manifold  is  calculated  by  Eq.(9).  is  iteratively  shifted  from  some  starting  points 
in  the  point  set  on  the  Riemannian  manifold. 

Theorem  2.  The  sequence  Mh-manifold(yk)k=l  2 . n  generated  htj  successive  ge¬ 

ometric  median  shift  over  Rieniannian  manifolds  converges  for  all  stalling  loca¬ 
tions  in  point  set  {.7;*}l= . »• 

Proof.  Since  Mh-manifoiAUk)  and  N  are  finite,  the  series  will  converge  if  there 
are  no  cycles,  i.e.  if  Mh-mamfolAlJk)  ^  Mh-manif0i(l(yk+w)  for  all  k  and  all 
iv  >  0.  According  to  Theorem  1  Fk,(y)  is  a  convex  function.  This  is  because 
exp(—<P(y,Xi))  is  a  convex  function.  We  then  have 


h(tJk+\ )  ~  Fk(Vk)  >  Vfir(flfc)tofc+1  //a  ).  (10) 


the  geometric  median  g  is 

y  =  x ,  where 


n 

ar«  mill  Y"  d2(y..r,,) 


(ii) 


Therefore  we  have 

n 

h{yk+i)  ~Fk(y)  >  '£/S7h(d2{y,:ri))(<P(yk+i.-ri)  -  d2{!/k,.r,))  (12) 

1=1 

If  Vk+i  =  MunifoldSy(aMh-ianyeni(yk)),  then  we  have 

n  ii 

^^(yk+i.-n)  >  y"'d2(yk..i:,),  (13) 

i=l  i=l 

Eq.  (13)  can  be  re-written  as 

71 

-<i2(yk.Ti))  >  0.  (14) 

7=1 

From  the  inequalities  Eq.  (12)  and  Eq.  (14).  we  can  deduce  that  the  inequality 
of  Ffc(t/fr+i.)  >  Ffc(yfr)  is  true  for  the  sequence  {f/o,  jo, . . . , t/*}  generated  from 

Eq.  (9).  The  value  in  the  corresponding  sequence  {Fk(yo)>  Ffc(tyi) . Fk(yk)}  is 

strictly  increasing.  This  leads  to  Fk(yk+w)  >  Fk(yk)  for  all  w  >  0,  arid  therefore 
we  have  yk+w  f  gk.  ■ 

Based  on  the  above  theorems,  the  Geometric:  median  shift  algorithm  over  a 

Rieniannian  manifold  is  given  in  Algorithm  1. 
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Algorithm  1.  Geometric  median  shift  over  Ricmaimian  manifolds 
Require:  Input:  Points  on  Ricmaimian  manifolds  xr  ( i  =  1,.. .  n),  e, 
bandwidth  h  and  a 
Output:  Distinct  local  inodes 

Extend  all  the  points  to  the  union  of  all  their  neighboiiioods  to  form  a  smooth  and 
continuous  point  space,  set  k  =  l,  and  check  if  0  <  a  <  2 
Ensure:  for  i  «—  1, . . .  n 


y  *-  *i 

do 

S  f  /„.\  _  Xj-j  within  a  window  /o\\ 

Mh-tangcnt(y)  —  y  (hq.  (o)) 

^  si  within  a  window  v  w 

y  <—  Manifoldsy(aMh-tangent(v))  (Eq-  (9)) 

Ulltil  ||  Mh  — tangent  (?/)  ||  ^  S’ 

yk  =  Xi* , where  i*  =  arg  min  ^  r/2(?/,  :r*i)  (Eq.  (11)) 

y6{Xj  within  a  window} 

Retain  yk  as  a  local  mode 
k  * —  k  4- 1 
end  for 


Theorem  3.  The  time  complexity  of  our  algorithm  for  finding  the  geometric  me¬ 
dian  of  data  points  is  0(n 2),  where  n  is  the  number  data  points  in  a  set. 

Proof.  Let  us  examine  the  algorithm  that  includes  two  main  steps. 

Step  1:  Calculating  the  sum  of  distances  to  other  n- 1  points  for  each  point 
needs  0(n  —  1).  Therefore,  it  takes  0(n  *  (n  —  1))  to  get  the  sum  distances  of  all 
points  in  a  data  set. 

Step  2:  It  is  obvious  that  selecting  the  points  that  has  the  minimum  sum  of 
distances  to  others  as  the  geomertric  median  of  data  points  set  needs  O(zi).  The 
time  complexity  needs  O(n)  T  Q(n  *  (n  —  1))  =  0(n2).  ■ 


4  Experiments 

We  have  implemented  the  Geometric  Median-shift  algorithm  on  manifolds  in 
C-f-f  and  Mat  lab.  A  number  of  experiments  are  performed  to  evaluate  the  per¬ 
formance  of  our  algorithm.  Note  that  the  orientation  of  curves  on  a  manifold 
will  not  be  considered  in  the  experiments. 

4.1  Geometric  Median  Shift  on  Synthetic  Datasets 

In  Figure  5,  we  compared  our  method  to  the  mean-shift  over  Euclidean  space, 
over  Riemannian  manifolds,  and  over  both  Euclidean  space  and  Ricmamiian 
manifolds,  respectively. 

Without  surprise,  the  result  ing  clusters  vary  in  size,  shape  and  so  on,  as  shown 
in  Figure  6.  Only  is  our  proposed  algorithm  able  to  find  the  correct  clusters. 
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Fig.  5.  800  points  distributed  on  the  neighborhood  of  6  peaks  of  a  surface 


(a)  (b)  (c)  (d) 

Fig.  6.  The  comparison  of  Geometric  Median-shift  and  Mean-shift  clustering  on  points 
over  manifold  and  Euclidean  space,  (a)  6  resulting  clusters  have  been  found.  This  is 
because  the  geodesic  distance  and  a  geometric  median  as  a  true  point  are  employed, 
(b)  and  (d)  show  the  comparison  between  the  processes  of  applying  the  geometric 
median  shift  to  points  in  Euclidean  space.  The  median  and  mean  of  points  represent 
the  true  point  in  the  original  data  set  and  non-existing  point.  The  points  on  manifold 
and  Euclidean  space  are  measured  by  the  geodesic  distance  and  Euclidean  distance, 
respectively.  These  two  facts  lead  to  different  cluster  results  on  the  Geometric  median 
shift  and  mean  shift  over  Riernannian  manifold  and  Euclidean  space,  (c)  3  clusters  were 
formed.  This  is  because  the  mean  of  point  s  can  be  a  \  irtue  point  corresponding  to  the 
process  of  geometric  median  shift. 


4.2  Geometric  Median  Shift  on  Real  Datasets 

We  applied  our  algorithm  to  cluster  the  data  points  of  4  swiss-roll  type  data 
sets  [(>].  each  Swiss  roll  with  about  500  data  points  distributed  on  a  manifold 
as  shown  in  Figure  7.  The  clustering  results  using  mean  shift  algorithm  via 
Euclidean  distance  are  not  as  good  as  those  using  geometric  median  shift  over 
Riemamiian  manifold  via  Riernanniaii  geodesic  distance  in  [10]. 

Furthermore,  we  compare  geometric  median  shift  algorithm  with  the  mean 
shift  using  different  h  values  in  Eq.(3)  by  testing  4  swiss  rolls  data  sets.  The 
chosen  validation  metric  for  evaluating  our  clustering  results  is  the  average  Eu¬ 
clidean  distance  to  each  cluster  center.  The  smaller  the  average  Euclidean  dis¬ 
tance  is,  the  better  the  clustering  results  are.  Without  surprise,  the  average 
distance  between  the  center  and  the  members  of  each  cluster  formed  by  our 
method  is  smaller  than  that  formed  by  mean  shift.  This  is  because  the  geodesic 
distance  characterizes  the  data  distribution  better  than  the  Euclidean  distance 
for  the  swiss  roll  data  sets.  The  two  results  are  reported  in  Tables  1  and  2.  re¬ 
spectively.  Note  that  outliers  are  excluded  for  computing  the  distances  to  cluster 
centers  in  following  experiments. 
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Fig.  7.  Resulting  clusters  using  Mean  shift  clustering  result  (left)  in  Euclidean  space 
and  geometric  median  shift  (right)  over  Rieniannian  manifolds  using  Riemannian 
geodesic  distances  in  [10]  with  window  size  1  x  1 


Table  1.  Clustering  results  of  geometric  median  shift  using  Riemannian  geodesic  dis¬ 
tance  in  [10]  tested  on  4  swiss  rolls  in  [G] 


h  window  size  jjcluser  Average  distance  to  each  cluster  center 


minimum 

maximum 

average 

2 

2  x 

2 

4 

3.6582 

3.8344 

3.7214 

3 

3  x 

3 

4 

3.6576 

3.9982 

3.7152 

4 

4  x 

4 

4 

3.G827 

3.8991 

3.7921 

5 

5  x 

5 

4 

3.01 19 

4.0017 

3.9133 

G 

6  x 

G 

4 

3.6772 

3.9982 

3.7287 

7 

7  x 

7 

4 

3.6225 

3.8256 

3.7821 

8 

8  x 

8 

4 

3.(i77(> 

3.7815 

3.6988 

9 

9  x 

9 

4 

3.6852 

3.7881 

3.6879 

10 

10  x 

10 

4 

3.6684 

3.7751 

3.6693 

Table  2.  Clustering  results  of  mean  shift  using  Euclidean  distance  tested  on  1  swiss 
rolls  in  [6] 


h  window  size  Scluser  Average  distance  to  each  cluster  center 
minimum  maximum  average 


2 

2  x2 

2 

4.0582 

4.5844 

4.3213 

3 

3  x3 

1 

4.1265 

4.1265 

4.1265 

4 

4  x4 

1 

4.1394 

4.1394 

4.1394 

5 

5  x5 

1 

4.1442 

4.1442 

4.1442 

6 

6  xO 

1 

4.1492 

4.1492 

4.1492 

7 

7  x7 

1 

4.1699 

4.1699 

4.1699 

8 

8  x8 

1 

4.1742 

4.1742 

4.1742 

9 

9  x9 

1 

4.1764 

4.1764 

4.1764 

10 

10  x  10 

1 

4.1791 

4.1791 

4.1791 
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Fig.  8.  The  results  of  visualizing  tin*  five  crescents  data  sets  in  [6] 


Fig.  9.  Cluster  number  of  three  differernt  algorithms  tested  oil  5  crescents  data  sets 


(a)  (b)  <c) 

Fig.  10.  Average  sum-of-distance  for  each  cluster  of  Geometric  median  shift  via  Uie- 
maimiau  geoedesic  distance  (a)  mean  shift  via  Riemannian  gooedosic  distance  (b)  and 
mean  shift  via  Euclidean  distance  (c) 


Using  geometric  median  shift  and  mean  shift  algorithms  over  Riemannian 
manifolds  via  Riemannian  manifold  distance  in  [10],  as  well  as  mean  shift  in 
Euclidean  distances.  We  tested  five  crescents  data  sets  [G]  (  see  Figure  8),  which 
contain  5053  data  points  with  different  sizes  of  parameter  h  in  Eq.(3).  The 
clustering  results  in  terms  of  both  the  average  distance  of  the  center  to  each 
cluster  center  and  the  running  time  are  listed.  We  also  calculated  the  average 
sum-of-distance  for  each  cluster  and  running  time  with  with  different  sizes  of 
parameter  h  in  Eq.(3),  as  shown  in  Figures  10  and  11,  respectively. 
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Fig.  11.  Running  time  of  three  different  algorithms  tested  on  5  crescents  data  sets 

4.3  Results  Analysis 

As  shown  in  Figure  1,  geometric  median  shift  is  always  iterated  to  next  existing 
point  rather  than  a  non-existing  point  in  mean  shift.  The  bigger  value  of  Eq.(5), 
the  higher  likely  it  will  converge  to  the  current  existing  point.  This  happens  in  a 
case  where  there  arc  the  higher  number  of  points  around  the  current  point,  and 
the  smaller  sum  of  squared  distance  from  the  current  point  to  all  others  within 
a  window.  This  is  different  from  the  mean  shift,  in  which  a  mean  is  defined 
as  the  gravity  of  points.  The  mean  may  not  thereby  be  a  true  point,  so  the 
algorithm  will  continue  to  converge.  Due  to  this  reason,  the  mean  shift  always 
produces  the  smaller  number  of  clusters  than  Geometric  median  shift  does,  such 
as  the  examples  shown  in  Figure  9.  Mean  shift  commonly  groups  the  points  even 
with  the  relatively  large  geodesic  distances  through  some  non-existing  points.  So 
its  average  distance  to  cluster  centers  is  larger  than  one  by  geometric  median 
shift.  This  fact  has  been  validated  by  results  reported  in  Figure  10.  For  all  these 
reasons,  in  terms  of  average  snm-of-distances,  geometric  median  shift  is  able  to 
produce  cluster  results  that  are  better  than  the  mean  shift  algorithm,  particular 
for  date  sets  with  implicit  manifolds.  It  takes  O(n)  time  to  obtain  the  mean  of 
points  while  the  calculation  of  the  Geometric  median  takes  ()(n2).  Because  of 
this,  the  running  time  of  the  mean  shift  algorithm  is  less  than  that  of  geometric 
median  shift  both  on  manifolds  and  Euclidean  distances.  In  general,  calculating 
t  he  geodesic  distance  takes  more  time  than  Euclidean  distance  between  two  data 
points.  So  the  geometric  median  shift  over  Riemannian  manifolds  spends*  more 
time  than  others,  which  leads  to  the  results  in  Figure  II.  Further,  if  the  shifting 
window  size  is  increased,  the  number  of  iteration  will  be  decreased.  The  running 
time  is  also  reduced  accordingly.  In  addition,  the  increase  of  the  size  of  the 
shift  window  will  make  the  more  number  of  outliers  to  become  the  members  of 
clusters.  This  makes  the  average  suin-of-distance  bigger,  as  shown  in  Figure  10. 
This  fact  is  true  for  both  Geometric  median  shift  and  mean  shift  algorithms. 

5  Conclusion 

Manifold  clustering  attracts  more  and  more  attentions  in  recent  years.  In  this 
paper,  we  have  presented  a  geometric  median  shift  algorithm  for  clustering 
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data  points  oil  Riemannian  manifolds.  Given  two  data  points,  their  Rieinan- 
nian  geodesic  may  not  equal  to  their  corresponding  Euclidean  distance.  This 
fact  may  lead  to  forming  different  clustering  results  by  using  the  mean  shift. 
From  the  experiments,  we  conclude  that  the  clustering  results  by  using  the  true 
median  point  on  a  manifold  are  more  accurate  than  those  by  the  mean  shift  in 
Euclidean  space.  Furthermore,  compared  to  using  Tukey  median,  our  algorithm 
for  calculating  the  geometric  median  reduces  the  complexity  from  0(7i2logrc2) 
to  0(n2).  Applying  the  proposed  approach  to  more  applications  is  our  future 
work. 
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Abstract.  Manifold  clustering,  which  regards  clusters  as  groups  of  points  around 
compact  manifolds,  has  been  realized  as  a  promising  generalization  of  traditional 
clustering.  A  number  of  linear  or  nonlinear  manifold  clustering  approaches  have 
been  developed  recently.  Although  they  have  attained  better  performances  than 
traditional  clustering  methods  in  many  scenarios,  most  of  these  approaches  suffer 
from  two  weaknesses.  First,  when  the  data  are  drawn  from  hybrid  modeling,  i.c., 
some  data  manifolds  are  separated  but  some  are  intersected,  existing  approaches 
could  not  work  well  although  hybrid  modeling  often  appears  in  real  data.  Sec¬ 
ond,  many  approaches  require  to  know  the  number  of  clusters  and  the  intrinsic 
dimensions  of  the  manifolds  in  advance,  while  it  is  hard  for  the  user  to  provide 
such  information  in  practice  In  this  paper,  we  propose  a  new  manifold  clustering 
approach,  mumCluster,  to  address  these  issues.  Experimental  results  show  that 
the  performance  of  the  proposed  mumCluster  approach  is  encouraging. 


1  Introduction 

Traditional  clustering  methods,  such  as  I\  -means  [  1 1,  are  based  on  the  idea  that  a  cluster 
is  centered  around  a  single  point  when  measuring  similarity.  Recently,  a  large  number 
of  research  efforts  have  show  n  that  the  perceptually  meaningful  structure  of  the  points 
possibly  resides  on  a  low-dimensional  manifold  [2,3].  Therefore,  regarding  cluster  as 
a  group  of  points  around  a  compact  manifold  becomes  a  reasonable  and  promising 
generalization  of  traditional  clustering,  leading  to  manifold  clustering  [4]. 

Roughly  speaking,  the  research  on  manifold  clustering  can  be  classified  into  two 
branches,  i.e.,  linear  and  nonlinear.  Generalized  Principal  Component  Analysis  (GPCA) 
[5,6]  and  K -planes  [7,8,9]  assume  the  samples  to  be  well  approximated  by  a  mixture  of 
affine  subspaces  (or  ltnear  manifolds).  However,  manifolds  in  natural  data  are  generally 
nonlinear  in  the  original  space  [2].  Spectral  clustering  (SC)  [10,1  1]  is  a  good  option 
when  the  samples  are  lying  on  separated  clusters  where  each  cluster  contains  points 
sampled  from  a  single  nonlinear  manifold.  Alternatively,  Cao  and  Haralick  [12]  use 
the  local  dimension  and  mean  square  error  to  infer  clusters.  However,  when  there  are 
intersections  among  clusters,  their  performance  will  degenerate.  K -manifolds  [4]  is 
primarily  motivated  to  cluster  samples  generated  from  intersecting  nonlinear  manifolds, 
which  will  fail  when  the  clusters  are  widely  separated. 

B.-T.  Zhang  and  M.A.  Orgun  (Eds.):  PRICAI  2010,  LNAI  6230.  pp.  280-29 1 , 20 1 0. 
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Fig.  1.  Data  points  drawn  from  a  h>brid  modeling 


There  are  two  main  difficulties  for  existing  methods.  On  the  one  hand,  they  usually 
work  well  either  in  separated  case  or  in  intersecting  case.  When  the  input  data  points 
are  drawn  from  a  hybrid  modeling  (see  Figure  I)  where  some  manifolds  are  separated, 
while  some  others  are  intersected  with  each  other,  the  quality  of  clustering  degenerate. 
On  the  other  hand,  many  of  existing  methods  require  the  user  to  prov  ide  the  number  of 
elusters  and  their  intrinsic  dimensions  in  advance,  while  such  information  are  difficult 
to  be  given  in  practice.  For  example,  considering  a  data  set  consisting  of  face  images  of 
different  indiv  iduals  under  various  lighting  conditions,  it  is  difficult  for  the  user  to  know 
whether  the  underlying  manifolds  are  separated  or  intersected,  as  well  as  the  number  of 
clusters  and  the  intrinsic  dimensions  ahead.  Thus,  to  enable  manifold  clustering  to  deal 
with  more  real  tasks,  it  is  important  to  design  manifold  clustering  approaches  which  are 
able  to  work  well  when  the  samples  are  drawn  from  hybrid  modeling,  and  which  can 
adaptively  determine  the  number  of  clusters  and  dimensions. 

In  this  paper,  we  propose  a  new  manifold  clustering  method  called  mumCluster 
(MUIti-Mamfold  Clustering).  Our  basic  idea  is  based  on  the  observation  that  if  we  can 
make  the  constructed  undirected  graph  in  spectral  clustering  more  faithful,  i.e.,  data 
points  belonging  to  different  manifolds  will  not  be  connected,  then  spectral  clustering 
can  be  used  to  identify  different  manifolds  accurately.  Thus,  our  scheme  first  identi¬ 
fies  the  separate  subsets  of  the  original  data,  and  then  determines  whether  a  subset  is 
composed  of  a  single  manifold  or  intersecting  manifolds.  For  each  intersecting  subset, 
we  will  exclude  the  influence  of  the  inaccurate  connected  relationships  among  differ¬ 
ent  manifolds.  Finally,  spectral  clustering  is  used  to  further  infer  clusters.  Moreover,  a 
strategy  is  developed  to  automatically  determine  the  number  of  manifold  clusters  and 
their  corresponding  dimensions. 

The  rest  of  this  paper  is  organized  as  follows:  Section  2  briefly  reviews  the  related 
manifold  clustering  methods.  In  Section  3,  the  mumCluster  method  is  presented,  fol¬ 
lowed  by  a  strategy  to  determine  the  number  of  clusters  and  their  dimensions.  Compu¬ 
tational  complexity  analysis  of  the  proposed  method  is  also  presented  in  this  section.  In 
Section  4,  we  experimentally  evaluate  the  performance  of  our  proposed  method  using 
synthetic  and  real-world  data.  Section  5  concludes  this  paper. 

2  Related  Work 

Cluster  analysis  [13J  seeks  to  group  internally  similar  objects  into  the  same  cluster 
while  dissimilar  objects  into  different  clusters.  Traditional  clustering  methods,  such  as 
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A'-means  [1],  assume  the  data  are  centered  around  some  prototypes.  They  could  not 
separate  clusters  that  arc  nonlinearly  separable  or  centered  around  manifolds. 

GPCA  [5,6]  and  K -planes  [7,8,9]  are  representative  linear  manifold  clustering  meth¬ 
ods.  GPCA  models  the  underlying  manifolds  with  a  set  of  homogeneous  polynomials, 
then  the  constructed  models  arc  used  to  inter  clusters.  Alternatively,  7\ -planes  addresses 
linear  manifold  clustering  by  iterating  between  assigning  data  to  manifolds,  and  model¬ 
ing  a  manifold  to  each  cluster.  Although  successful  for  mixtures  of  linear  clusters,  both 
of  them  fail  to  deliver  good  performance  in  the  presence  of  nonlinear  structures  (e.g.. 
Figure  3  (a)  and  (b)).  Since  nonlinear  methods  can  also  work  well  on  linear  clusters,  in 
this  paper,  we  focus  on  the  nonlinear  manifold  clustering. 

Spectral  clustering  [  10,1 1]  is  a  good  option  for  nonlinear  manifold  clustering  when 
samples  are  generated  from  separated  clusters  where  eaeh  eluster  contains  data  points 
from  a  single  manifold  [14].  However,  when  there  are  intersections  in  some  areas, 
spectral  clustering  could  not  work  well  (e.g..  Figure  3  (c)).  The  reason  is  that  the  per¬ 
formance  of  spectral  clustering  is  heavily  relied  on  the  constructed  undirected  graph, 
different  clusters  near  a  manifold  intersection  will  easily  be  connected  by  the  undi¬ 
rected  graph,  thus  diffusing  information  across  the  wrong  manifolds  [  15].  /v -manifolds 
[4]  groups  data  lying  on  intersecting  nonlinear  manifolds,  which  begins  by  estimat¬ 
ing  geodesic  distances  between  points,  then  an  expectation  maximization  (EM)  type 
strategy  is  used  to  iterate  between  estimating  the  manifolds  using  node-weighted  MDS 
and  assigning  each  point  to  the  specified  manifolds.  Unfortunately,  the  estimation  of 
geodesic  distances  fails  when  there  arc  separated  clusters,  leading  to  incorrect  cluster¬ 
ing  (e.g..  Figure  3  (d)).  The  method  most  related  to  ours  was  proposed  by  Cao  and 
Haraliek  [12|,  which  groups  neighboring  points  into  a  cluster  if  the>  have  the  same 
local  dimension  and  the  mean  square  error  of  representing  the  new  cluster  is  small. 
This  method  can  handle  the  hybrid  modeling  to  some  extent,  by  using  graph  methods 
to  identify  different  connected  components.  However,  it  is  primarily  based  on  the  local 
dimension,  thus  the  method  usually  treats  the  intersections  as  clusters  since  the  local 
dimension  in  the  intersections  are  higher  than  the  other  areas  (e.g..  Figure  3  (e)). 

3  MumCluster 

Given  a  set  of  data  points  X  —  {.r*  E  SRD ,  i  —  1,2,  •••.TV}  sampled  from  A*  >  1 
distinct  manifolds  {S2j  C  =  1,2,  ••*,&}  with  dimension  d3  =  dim(J?j), 

0  <  dj  <  D.  The  samples  are  unorganized,  i.e.,  we  do  not  know  which  points  belong  to 
which  manifold.  Moreover,  some  manifolds  are  intersected  with  each  other  which  form 
intersecting  manifolds.  Our  objectives  are: 

\ .  Identify  the  number  of  manifolds  k  and  their  intrinsic  dimensions  {dj  y  j  —  1,2.  ••  • ,  A*}; 
2.  Partition  the  given  samples  into  the  manifold(s)  they  belong  to. 

Though  a  considerable  amount  of  work  has  been  done  in  this  field,  as  wc  have  reviewed 
before,  they  could  not  work  well  on  the  hybrid  modeling.  Moreover,  many  of  them 
need  the  user  to  specify  k  and  {dj,  j  =  1, 2,  •  •  • ,  A  }.  In  what  following,  we  propose  the 
mumCluster  method  to  address  these  issues. 

Our  main  strategy  is  trying  to  construct  more  faithful  undirected  graph  in  spectral 
clustering,  i.c.,  data  points  belonging  to  different  manifolds  will  not  be  connected. 
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Therefore,  mumClustcr  designs  a  ‘‘divide  and  conquer”  strategy  to  realize  this  pur¬ 
pose.  This  scheme  first  divides  the  complicated  intersecting  manifolds  from  the  single 
manifolds,  then  each  intersecting  subset  is  further  divided  into  intersection  areas  and 
non-intersection  areas.  More  attention  is  paid  to  the  intersection  areas,  where  many  of 
the  inaccurate  connected  relationships  situated.  The  details  of  the  method  arc  presented 
in  Subsection  3.1,  followed  by  a  strategy  to  automatically  determine  the  number  of 
clusters  and  their  dimensions  in  Subsection  3.2.  Complexity  analysis  is  presented  in 
Subsection  3.3. 

3.1  To  Deal  with  Hybrid  Modeling 

Generally,  hybrid  modeling  can  be  divided  into  different  connected  subsets,  with  some 
subsets  containing  only  single  manifold,  while  the  others  containing  intersecting  man¬ 
ifolds.  To  deal  with  the  two  different  structures  separately,  we  propose  to  use  spectral 
clustering  to  partition  the  samples  coarsely  into  different  connected  subsets.  Generally, 
there  are  different  versions  of  spectral  clustering.  Following  von  Lux  burg’s  suggestion 
[14],  the  following  un symmetrical  normalized  spectral  clustering  [  10]  is  adopted: 

L  Constructing  a  similarity  graph  (7:  Put  an  edge  between  node  i  and  j  if  i  is  among 
L  nearest  neighbors  of  j,  and  vice  versa. 

2.  Determining  the  weighted  matrix  W:  If  node  /  and  j  are  connected,  then  put  a 
weight  Wjj  as  w7j  =  1  (simple  weight);  otherwise,  put  Wij  =  0. 

3.  Spectral  decomposition:  Compute  the  first  r  eigenvectors  u j.  112,  •  •  • . wr,  corre¬ 
sponding  to  the  r  smallest  eigenvalues,  of  the  generalized  eigenproblem  Ev  =  A Fu< 
where  F  is  a  diagonal  matrix  with  Fa  =  Yljwij  an(J  &  =  ^  ~  ^  U  ~ 


4.  Clustering  by  K -means:  Group  the  points  i  —  1,  2,  •  •• .  N  into  r  clusters  using 
K -means,  where  y,  is  the  vector  corresponding  to  the  /- th  row  of  IJ . 

In  the  above  procedure,  r  should  be  provided.  We  will  discuss  on  how  to  decide  r  in 
the  next  subsection. 

After  the  different  connected  subsets  5"c,  c  =  1,  ♦  •  • ,  r  have  been  identified,  the  prob¬ 
lem  is  how  to  determine  their  structure,  i.e.,  single  or  intersecting.  For  this  purpose,  our 
basic  idea  is  to  resort  to  the  intrinsic  dimension  id.  It  is  based  on  the  observation  that 
if  samples  come  from  a  single  manifold,  then  the  intrinsic  dimension  of  each  point  on 
this  manifold  should  be  the  same;  otherwise,  they  are  different.  Details  on  estimating 
id  will  be  presented  in  the  next  subsection. 

If  the  connected  subset  consists  of  a  single  manifold,  then  a  manifold  cluster  has  been 
revealed.  However,  for  each  intersecting  subset  JE’7*,  further  procedures  are  needed  to 
reveal  different  manifold  clusters.  The  first  should  be  to  identify  the  intersection  areas 
Ilul  and  the  non-intersection  areas  IJrlta.  Generally,  the  points  in  l!ut  have  higher 
dimension  than  the  other  parts.  Therefore,  the  points  with  the  highest  dimension  dnmx 
should  be  first  grouped  into  llia .  In  practice,  the  structure  in  the  intersection  area  is 
usually  complex.  To  ensure  this  area  to  be  identified  accurately,  the  ^-neighbors  can  be 
used.  That  is. 


(1) 


where  xtp  is  any  point  with  dimension  dmax-  Finally,  is  divided  into  IVa  and  Ilnia. 
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The  points  in  77  m  and  IJflin  may  consist  of  many  small  clusters  (called  intersection 
clusters  and  non-intersection  clusters ,  respectively),  which  should  be  grouped  in  order 
to  tackle  them  separately.  Generally,  these  clusters  are  unconnected,  thus  spectral  clus¬ 
tering  can  still  be  used  here  to  group  them.  If  the  dimensions  on  some  non-intersection 
clusters  are  different,  it  implies  that  there  may  still  exist  some  other  intersection  clusters 
with  lower  dm ax-  Therefore,  we  should  go  back  to  identify  these  areas  until  there  is  no 
hidden  intersection. 

The  intersection  area  implies  that  there  are  different  manifolds  passing  across  each 
other  which  should  be  revealed.  Though,  the  manifold  clusters  are  nonlinear,  each  inter¬ 
section  cluster  can  be  considered  as  a  mixture  of  manifolds  with  linear  structure  since  it 
is  a  local  area.  T hus,  A'-planes  can  be  adopted  to  reveal  the  different  manifolds  (named 
fine  clusters)  in  each  intersection  cluster.  Specifically,  given  the  number  of  clusters  k* 
and  the  dimensions  ••  • ,  d . 

1.  Initialization:  Assign  each  point  to  a  cluster  randomly  to  give  an  initial  partition 
{C*,  C£ ,  •  •  ■ ,  Cf }.  Then,  alternating  between  the  following  two  steps  until 
convergence. 

2.  Cluster  update:  Find  a  center  ft*  and  a  set  of  bases  <I>i  —  [p>] ,  <px2 ,  •  •  • ,  <^.]  for 
cluster  C*  such  that  the  reconstruction  error  is  minimum. 

3.  Cluster  assignment:  For  each  point  in  the  considered  intersection  cluster,  de¬ 
termine  the  space  j  such  that 

-/!*) 

„  (2) 
=  .  min  (x;n  -  -  /*?), 

1=1, ■■■,*• 

where  7  is  an  identity  matrix.  Then,  x*n  is  assigned  to  the  j- th  cluster  C* . 

As  indicated  before,  the  constructed  undirected  graph  for  each  intersecting  subset 
may  connect  different  manifolds,  making  the  partition  of  samples  into  the  manifold 
they  belong  to  impractical.  To  reveal  different  manifolds,  the  connections  between  them 
should  be  cut  out,  and  should  be  preserved  among  the  same  manifold.  Since  the  unfaith¬ 
ful  connections  mainly  come  from  the  different  fine  clusters,  we  cut  the  connections 
among  them,  while  connect  all  the  points  in  the  same  fine  cluster  to  preserve  the  man¬ 
ifold  structure.  Finally,  a  new  undirected  graph  Gncw  is  obtained  for  each  intersecting 
subset  E2S.  Thus,  spectral  clustering  is  used  to  finally  group  points  in  eaeh  E1S  into 
different  manifold  clusters. 

3.2  To  Determine  the  Number  of  Clusters  and  the  Intrinsic  Dimensions 

Hereinbefore,  we  have  shown  our  scheme  to  partition  the  given  samples  into  the  man¬ 
ifold  they  belong  to.  However,  it  is  based  on  the  given  number  of  clusters  and  their  in¬ 
trinsic  dimensions,  and  how  to  adaptively  determine  these  parameters  are  not  resolved. 
In  the  following,  we  propose  to  use  cigengap,  local  intrinsic  dimension  estimator  and  a 
new  bottom-up  search  procedure  to  address  these  issues. 
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First,  as  demonstrated  in  [14],  the  number  of  connected  components  r  in  the  adopted 
spectral  clustering  equals  the  multiplicity  r  of  the  eigenvalue  zero  of  the  generalized 
eigen-problem.  Therefore,  r  can  be  determined  by  using  the  eigengap  heuristic.  That  is, 

if  | A| .  —  A/_j |  <  10-6  <  |A/+i  —  A/ 1 ,  then  /•  = /,  (3) 

where  10  6  is  used  to  replace  zero  to  avoid  numeric  problem. 

The  intrinsic  dimension  id  of  each  point  can  be  estimated  by  using  a  local  dimension 
estimator.  It  is  based  on  the  observation  that  though  the  manifold  structures  are  glob¬ 
ally  nonlinear,  they  are  locally  linear  [3].  Moreover,  it  is  known  that  the  first  id  largest 
eigenvalues  of  the  covariance  matrix  are  significantly  higher  than  the  others  and  thus 
can  be  used  as  an  estimation  to  the  intrinsic  dimension,  w  hen  the  original  data  are  sam¬ 
pled  from  an  /^/-dimensional  manifold  [16].  In  more  detail,  we  can  estimate  the  intrinsic 
dimension  by: 

1.  Calculate  the  local  covariance  matrix:  For  each  point  find  its  L  nearest  neigh¬ 
bors  ./*• ,  •  •  • ,  .rj\  then  calculate  the  local  covariance  matrix 

Ci  =  l/L]Tj=l(xi-t>i)(xi-/*i)T'  (4) 

where  //,  =  1  /L  ,  :rj  is  the  mean  vector. 

2.  Intrinsic  dimension  estimation:  Determine  the  sorted  eigenvalues  Aj  >  •  •  •  >  A^ 
of  Cj, 

if  A* /Aj  <  0.05  <  Aj_j/A’n  then  id  =  j  I  .  (5) 

More  challenging  is  to  determine  k*  and  d^d^  •  •  * .  </£.  in  the  K -planes  algorithm 
which  is  used  to  reveal  fine  clusters  in  each  intersection  cluster.  Our  solution  is  based  on 
a  bottom-up  search  strategy,  which  starts  from  the  lowest  dimension  dm\n.  Moreover, 
we  can  determine  the  possible  dimensions  and  the  number  of  clusters,  w  hich  reduce  the 
search  space.  First,  let  us  introduce  the  following  notion. 

Definition:  Effective  Dimension  (ED)  [17] 

Given  k  subspaces  47  =  (J  A  1  47,  ///  3?^  of  dimension  d,  <  D,  and  N,  sample  points 
Xi  =  { .rj ,  j  —  1,  *  •  • ,  Ni}  drawn  from  each  subspace  47, ,  the  effective  dimension  is 
defined  to  be: 

ED(X,  n)  =  1  /N  di{D  -  <u )  +  l/.vT  Nidi .  (6) 

Effective  dimension  ED( X.  /2)  is  the  “average"  numbers  needed  to  assign  to  per  sam¬ 
ple  of  X .  Generally,  there  could  be  many  manifold  structures  47  which  can  fit  X ,  while 
the  manifold  structure  that  leads  to  the  minimum  FT)  normally  corresponds  to  an  “ef¬ 
ficient"  and  hence  “natural”  interpretation  of  the  data,  sec  [  17].  Formally,  ED  is  low  if 
the  number  of  clusters  and  dimension  of  each  cluster  are  small.  Therefore,  to  faithfully 
fit  the  underlying  manifold  structure,  we  should  search  for  the  structure  which  mini¬ 
mizes  ED  among  all  possible  structures  under  certain  criterion.  To  be  consist  with  the 
A'-planes  algorithm,  the  reconstruction  error  is  a  good  choice. 
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mumClustcr(X,  L,  e,  Cmnx) 

Input: 

X:  D  x  N  feature  matrix 

L:  number  of  nearest  neighbors 

e:  threshold  for  determining  the  intersection  area 

Cmax-  maximum  error  threshold 

Process: 

1  Construct  graph  G  w  ith  weighted  matrix  W 

2  Group  using  spectral  clustering  on  W  with  eigengap 

3  for  each  connected  subset 

4  Compute  the  intrinsic  dimension  id  for  each  point 

5  if  id's  are  the  same 

6  Output  this  connected  subset  as  a  cluster 

7  else 

8  Construct  a  new  graph  Gnew 

9  Group  using  spectral  clustering  on  G  new 

10  endif 

1 1  end 
Output: 

{Ci,  C2,  •  •  • ,  Ck}'-  the  results  of  clustering 

Fig.  2.  Pseudo-code  of  the  mumCluster  method 

To  reduce  the  search  space,  the  following  observation  is  considered:  the  intersec¬ 
tion  clusters  are  crossed  by  different  manifolds,  moving  continuously  from  the  non¬ 
intersection  clusters.  Suppose  an  intersection  cluster  is  connected  w  ith  m  non-intersection 
clusters,  then  the  dimensions  of  the  non-intersection  clusters  imply  the  possible  dimen¬ 
sions  of  the  fine  clusters,  w  hile  the  number  of  non-intersection  clusters  limits  the  number 
of  fine  clusters. 

Our  bottom-up  strategy  can  be  summarized  as  follows: 

1.  For  each  intersection  cluster,  determine  the  number  of  connected  non-intersection 
clusters  (i.e.,  rn)  and  the  dimension  of  each  non-intersection  cluster  (i.e.,  di, • •  • ,  dm); 

2.  Suppose  there  are  n  different  sorted  numbers  in  {rfi ,  •  •  • ,  drn  },  i.e.,  dx  <  •  •  •  <  dn. 
Assign  the  possible  number  of  clusters  to  the  range  from  n  to  m.  For  each  specified 
number,  the  dimension  for  each  cluster  is  given  by  one  number  in  {d* 1 2 ,  •  •  • ,  dn}  starting 
from  the  lowest  to  the  highest,  and  at  least  one  cluster  has  dimension  d3 4  ,j  —  1,  •  •  • ,  n. 

3.  For  each  given  number  and  dimensions  of  the  clusters,  compute  its  FD  if  the  recon¬ 
struction  error  by  A -planes  is  smaller  than  a  specified  maximum  error  (max-  Otherwise, 
ED  is  set  to  be  the  maximum  number  Armax  —  100. 

4.  The  best  number  of  clusters  and  their  dimensions  are  given  by  the  structure  with 
the  minimum  ED. 

Our  proposed  mumCluster  reveals  that  there  are  three  intersection  clusters  for  the  points 
sampled  from  Figure  1,  where  each  cluster  is  connected  with  m  =  4  non-intersection 
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Table  1.  Effective  dimension  (ED)  for  each  intersection  cluster  in  Figure  1  w.r.t  the  possible 
structure  (the  best  is  marked  in  boldface) 


Structure 

2 

(2,2) 

(2,2,2) 

(2, 2, 2, 2) 

Intersection  cluster  l 

100 

2.021 

2.031 

2.041 

Intersection  cluster  2 

100 

2.019 

2.029 

2.039 

Inti  rsection  cluster  3 

100 

2.020 

2.030 

2.040 

clusters.  The  possible  structure  (in  the  form  of  •  ■  • ,  d*k. )  for  k*  clusters)  and 

their  corresponding  effective  dimension  are  tabulated  in  Table  1. 

Figure  2  shows  the  Pseudo-code  of  mumClustcr. 

3.3  Complexity  Analysis 

The  computational  complexity  of  our  proposed  mumCluster  is  dominated  by  three 
parts-  intrinsic  dimension  estimation,  connected  components  search  and  fine  clusters 
identification.  Intrinsic  dimensions  of  N  D-dimensional  data  points  arc  estimated  by 
performing  local  PC  A  on  L  nearest  neighbors  of  each  point,  the  complexity  is  N  x 
0(LD  min(L,  D)).  Spectral  clustering  is  used  to  search  for  the  /  connected  compo¬ 
nents,  with  the  total  complexity  0((D  +  L  -f  r)N 2  +  Nr2t),  where  0((D  +  L)N2) 
stands  for  the  time  complexity  of  constructing  similarity  graph,  ()(rN2)  stands  for  the 
complexity  of  computing  the  first  r  generalized  eigenvectors  and  0(Nr2t)  is  the  com¬ 
plexity  of  I\  -means  in  7  -dimensional  space  for  t  iterations.  Since  r  <C  Ar,  L  N  and 
/v- means  converges  very  quickly,  the  complexity  of  connected  components  search  is 
limited  by  0(N 2  max(D,  A;)).  The  complexity  analysis  of  grouping  fine  clusters  using 
/\'-planes  is  not  straightforward,  since  we  do  not  know  the  exact  number  of  points  to  be 
grouped  and  a  bottom-up  scheme  as  shown  in  Subsection  3.2  is  needed  to  automatically 
determine  the  number  of  clusters  and  their  dimensions.  However,  following  the  same 
analysis  in  [8],  the  overall  worst  case  time  complexity  (an  upper  bound)  of  this  proce¬ 
dure  is  0(??/2)  •  0(DN  min(D,  N))  when  there  are  m  non-interscction  clusters.  Note 
that,  this  result  does  not  reflect  its  real  running  time  as  demonstrated  by  the  experiments 
presented  in  the  next  section.  To  sum  up,  the  computational  complexity  of  mumClustcr 
is  limited  by  0(N2  inax(D,  N))  in  total,  which  is  determined  by  the  number  of  data 
points  and  the  number  of  features. 

4  Experiments 

Wc  now  evaluate  the  performance  of  our  mumClustcr  using  synthetic  data  and  real  data. 
Note  that  the  number  of  manifold  clusters  and  their  dimensions  are  provided  for  all  the 
other  manifold  clustering  methods  except  for  mumCluster.  For  spectral  clustering  (SC), 
the  unsymmetrical  normalized  spectral  clustering  [  10J  is  used. 

4.1  Hybrid  Modeling  Data 

The  hybrid  modeling  data  shown  in  Figure  1  arc  drawn  from  one  helix,  one  swiss- 
roll,  and  one  two-dimensional  surface  in  The  number  of  points  are  200,  1000,  600, 
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Fig.  3.  Grouping  results  using  different  manifold  clustering  methods 


Table  2.  Clustering  accuracy  (*T)  of  the  different  methods  on  the  hybrid  modeling  data 


GPCA 

A'- PLANES 

SC 

A'- MANIFOLDS 

Cao-Haralick 

MumCluster 

38.1  1 

40.06 

57.39 

40.39 

60. 1 7 

99.06 

respectively.  As  we  can  see  from  Figure  3,  all  the  other  methods  do  not  work  well  on 
this  data  set.  Table  2  reports  the  clustering  accuracy  of  the  different  methods.  Obvi¬ 
ously,  our  method  performs  quite  well.  GPCA  and  K -planes  do  not  work  well  in  this 
nonlinear  case  because  of  their  linear  nature,  while  the  method  of  Cao  and  Haralick 
treats  the  intersections  as  clusters.  SC  diffuses  wrong  clustering  information  across  the 
intersecting  manifolds,  while  I\  -manifolds  fails  to  estimate  faithful  geodesic  distances 
when  there  are  separated  clusters. 

4.2  Single  Modeling  Data 

It  is  interesting  to  compare  our  mumCluster  with  SC  on  data  containing  multiple  single 
manifolds,  and  compare  w  ith  A -manifolds  on  data  containing  intersecting  manifolds, 
where  SC  and  A  -manifolds  can  work  well,  respectively.  It  is  easy  to  see  that  when 
points  are  sampled  from  multiple  separated  single  manifolds,  our  mumCluster  is  in 
fact  as  same  as  SC  and  therefore  the  results  are  not  presented  here  due  to  the  space 
limit.  In  the  following,  we  compare  mumCluster  with  A -manifolds  on  data  containing 
intersecting  manifolds.  The  spirals  data  set1  (see  Figure  I  of  [4|)  where  A'-manifolds 
can  work  well  is  used  for  the  comparison.  We  run  mumCluster  and  A'-manifolds  over 
five  random  samplings  from  this  evaluated  data  set,  as  well  as  the  other  methods  which 
can  be  used  for  intersecting  manifolds.  Table  3  reports  the  clustering  accuracy.  The 
results  demonstrate  that  mumCluster  generally  outperforms  the  other  methods. 

http://www.cs.wustl.edu/  rnis2/kmanifolds.htm 
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Table  3.  Clustering  accuracy  (%)  over  live  random  samplings  from  the  spirals  data  set 


Data  set 

A 

B 

C 

D 

E 

GPCA 

48.8 

42.4 

43.6 

44.8 

47.0 

K -PLANES 

48.2 

40.6 

49.4 

46.4 

46.4 

Cao-Haralick 

52.0 

50.6 

47.6 

51.0 

48.4 

/\ -MANIFOLDS 

98.0 

96.0 

97.6 

97.6 

96.6 

MumCluster 

100.0 

99.8 

100.0 

99.6 

99.2 

4.3  Illumination  Variant  Face  Clustering 

In  this  experiment,  the  face  images  in  the  Yale  Face  Database  B2  [18]  under  64  varying 
lighting  conditions  are  used.  We  strictly  follow  the  experimental  design  of  [5]  for  a  fair 
comparison,  that  is,  subjects  2,  5,  and  8  of  this  database  are  used  and  the  original  data 
are  projected  onto  low-dimensional  space  (here,  LLE  [3]  method  is  adopted  )  before 
manifold  clustering.  For  the  purpose  of  visualization,  we  use  the  class  information  to 
label  the  sample  as  shown  in  Figure  4  (a),  which  will  be  used  as  the  ground-truth  for 
comparing  the  different  approaches.  Note  that  the  class  information  of  the  samples  are 
not  provided  to  the  clustering  methods.  We  apply  mumCluster  and  the  other  methods 
to  group  the  data.  As  can  be  seen  from  Figure  4,  our  proposed  method  achieves  a  better 
clustering,  which  has  a  clustering  accuracy  of  86.98%,  while  the  clustering  accuracy 
of  the  other  methods  are  77.08%,  80.21%,  65. 10%,  5 1 .04%,  56.25%,  respectively.  The 
total  running  time  of  mumCluster  on  this  real-world  data  is  0.64s,  where  local  intrinsic 
dimension  estimation  costs  0.07s  while  fine  clusters  identification  costs  0.3 1  s. 


■  duster  1 

duster  2 
{ ■  duster  3 


2  2 
(c)  A  -manifolds 


2  2 
(0  Cao-Haralick 


2^2 
(g)  MumCluster 


(h)  Accuracy  (%) 


Fig.  4.  Clustering  results  using  different  methods  on  a  subset  of  the  Yale  Face  Database  B 


4.4  The  Influence  of  Parameters 

There  are  three  parameters  in  mumCluster.  In  this  subsection,  we  examine  their  impact 
on  the  performance  of  mumCluster  by  fixing  two  parameters  and  varying  the  concerned 

http://www.cs.uiuc.edu/hoines/dengcai2/Data/FaceData.html 
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fa)  Influence  of  L  on  Spirals 


(b)  Influence  of  s  on  Spirals  (c)  Influence  of  (II1!IX  on  Spirals 


(d)  Influence  of  L  on  Yale 


I  “ 
«  * 


£ 

(e)  Influence  of  e  on  Yale 


(0  Influence  of  Cmax  on  Yale 


Fig.  5.  Influence  of  parameters  on  mumCluster 


parameter.  The  results  on  the  spirals  data  set  A  and  the  Yale  Face  Database  B  are  plotted 
in  Figure  5.  We  have  studied  on  many  other  data  sets,  and  the  results  arc  similar  and 
thus  omitted  due  to  page  limit.  In  general,  the  optimal  values  of  these  parameters  depend 
on  the  distribution  of  the  samples,  while  it  is  easy  to  see  that  mumCluster  can  achieve 
good  performance  over  a  broad  rang  of  these  parameters.  In  detail,  the  performance  of 
mumCluster  is  generally  insensitive  to  the  setting  of  L,  as  long  as  it  is  neither  too  small 
nor  too  large.  The  reason  is  that  L  is  the  number  of  nearest  neighbors  whieh  will  not 
capture  enough  structure  information  and  may  lead  to  many  disconnected  subgraphs 
when  it  is  too  small,  while  local  property  will  lose  when  it  is  too  large.  Moreover,  as 
we  can  sec  that  the  results  on  the  Yale  data  have  more  fluctuation  than  on  the  synthetic 
data,  which  show  the  complexity  of  the  real-world  data  and  thus  more  attention  should 
be  paid  to  parameter  setting.  The  performance  of  mumCluster  will  degenerate  when  e  is 
large.  The  reason  is  that  e  controls  the  enlarged  area  of  the  intersection  points,  and  it  w  ill 
become  too  large  to  ensure  a  locally  linear  area.  MumCluster  is  relatively  insensitive  to 
the  setting  of  Cmax  *  as  we  can  sec  in  Figure  5  (c)  and  (f). 

Overall,  Figure  5  shows  that  setting  the  parameters  of  mumCluster  is  not  difficult, 
since  the  performance  of  mumCluster  is  robust  to  a  broad  range  of  parameter  values. 
Moreover,  among  the  three  parameters,  L  has  more  influence  on  the  performance  of 
mumCluster,  which  shows  that  local  intrinsic  dimension  estimation  is  a  key  step  in  our 
scheme.  However,  more  sophisticated  intrinsic  dimension  estimator  can  be  incorporated 
into  mumCluster  to  improve  the  performance,  which  is  our  ongoing  work. 


5  Conclusion 

In  this  paper,  we  propose  a  new  manifold  clustering  method,  i.e.,  mumCluster,  whieh 
can  work  well  when  the  samples  are  drawn  from  hybrid  modeling  and  can  adaptively 
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determine  the  number  of  clusters  and  the  intrinsic  dimensions.  Experimental  results 

show  that  mumCluster  is  superior  to  many  state-of-the-art  manifold  clustering  methods. 
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Abstract.  This  paper  presents  an  approach  to  integrate  word  clustering  informa¬ 
tion  into  the  process  of  unsupervised  feature  selection  In  our  scheme,  the  words 
in  the  whole  feature  space  are  clustered  into  groups  based  on  the  co-oecurrence 
statistics  of  words.  The  resulted  word  clustering  information  and  the  bag-of-word 
information  are  combined  together  to  measure  the  goodness  of  each  word,  which 
is  our  basic  metric  for  selecting  discriminative  features.  By  exploiting  word  clus¬ 
ter  information,  we  extend  three  well-known  unsupervised  feature  selection  meth¬ 
ods  and  propose  three  new  methods.  A  series  of  experiments  are  performed  on 
three  benchmark  text  data  sets  (the  20  Newsgroups,  Reuters-21578  and  CLAS- 
SIC3).  The  experimental  results  have  shown  that  the  new  unsupervised  feature 
selection  methods  can  select  more  discriminative  features,  and  in  turn  improve 
the  clustering  performance. 


1  Introduction 

Feature  selection  is  a  process  of  selecting  a  feature  subspaee  from  the  original  feature 
spaee  with  some  defined  criteria.  Depending  on  whether  the  class  label  information  is 
required  or  not,  feature  selection  methods  can  be  classified  into  two  categories,  i.e.  the 
supervised  approach  and  the  unsupervised  approach.  The  supervised  methods  rely  on 
the  correlation  information  between  features  and  class  label  information.  The  unsuper¬ 
vised  methods  do  not  need  the  class  label  information,  and  the  goodness  of  each  feature 
is  computed  according  to  its  own  representation.  Complete  reviews  can  be  found  in 
[9][13|. 

The  most  popular  unsupervised  feature  selection  methods  usually  employ  the  well- 
known  bag-of-word  representation  to  select  feature  subspaee.  In  these  methods,  features 
are  represented  with  respect  to  distinct  words  in  the  corpus  and  treated  as  independent 
word  vector  in  the  vector  space  model.  Feature  goodness  is  affected  by  the  frequency 
value  of  the  word  vector.  The  higher  the  frequency,  the  larger  the  feature  goodness. 
However,  in  a  high  dimensional  corpus,  there  are  usually  a  large  portion  of  low  fre¬ 
quency  words  that  are  informative  to  each  other.  The  contribution  of  these  words  to 
document  clustering  is  significant.  If  we  simply  select  features  without  considering  the 
correlation  between  words,  there  is  a  big  chance  that  these  low  frequency  words  with 
high  discriminative  capability  will  be  missed.  As  a  result,  the  average  discriminative 
capability  of  the  selected  feature  subspace  is  decreased. 
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In  this  paper,  we  propose  an  approach  to  integrate  word  clustering  information  into 
the  process  of  unsupervised  selection  methods.  In  our  scheme,  we  defined  a  similarity 
measure  of  the  correlation  of  eo-oeeurrenee  between  words.  After  calculating  the  sim¬ 
ilarity  measure  of  all  pairwise  words,  we  cluster  distinct  words  into  groups,  and  blend 
the  resulted  word  clustering  information  with  the  bag-of-word  information  to  measure 
the  goodness  of  each  feature.  This  method  increases  the  chance  of  the  inclusion  of  low 
frequency  features  because  the  defined  co  occurrence  similarity  measure  is  biased  to 
low'  frequency  words.  The  basic  idea  of  this  approach  can  be  explained  intuitively  as  fol¬ 
lows.  For  example,  consider  clustering  documents  about  sports  into  clusters,  w  here  each 
cluster  corresponds  to  individual  sport  category(e.g„  basketball,  football,  and  baseball). 
The  common  word  “teamwork”  related  to  all  three  categories  may  frequently  occur  in 
the  whole  corpus,  whereas  the  discriminative  words  “dunk"  and  “layup"  which  only 
related  to  the  basketball  category  may  only  occur  in  the  documents  about  basketball. 
To  select  feature  subspacc  from  the  whole  feature  space  in  which  the  discriminative 
words  are  less  frequent  than  those  common  words,  we  cluster  the  words  into  groups. 
There  is  a  big  chance  that  the  discriminative  words  only  occur  in  the  documents  about 
basketball  category  clustering  into  a  group  because  they  arc  likely  to  co-occur  together. 
Furthermore,  the  word  clustering  algorithm  can  sensibly  cluster  those  common  words. 
Thus  the  new  feature  selection  methods  integrating  word-cluster  information  can  give 
more  robust  estimation  to  the  goodness  of  the  low  frequency  discriminative  words. 

Our  contributions  in  this  paper  are: 

-  First,  we  cluster  words  into  groups  specifically  for  the  benefit  of  unsupervised  fea¬ 
ture  selection  in  this  paper.  While  much  study  has  been  devoted  to  word  clustering 
for  text  categorization  and  text  clustering! 2] [ 3] [6] [  1 1],  but  little  work  has  been  done 
on  word  clustering  for  unsupervised  feature  selection.  The  word  clustering  has  ad¬ 
vantages  over  simple  bag-of-word  as  follows.  Word  clustering  provides  a  implicitly 
description  to  the  semantically-related  correlations  between  various  words.  Word 
clustering  also  provides  a  solution  to  the  sparse  and  high  dimensional  challenge  of 
text  data  set  by  generating  a  reduced-size  and  compact  space.  But  it  has  to  mention 
that  directly  using  word  clusters  as  features  for  document  clustering,  for  example 
in[  1 1 1  [2],  will  suffer  a  reduction  in  performance  if  the  word  clusters  to  compose 
the  feature  space  arc  imbalance  and  impure.  Indeed,  up-to-date  the  best  results  for 
the  well-known  Reuters-21578  and  20  Newsgroups  data  sets  are  both  use  words  as 
features!  1 0][  1 2].  As  a  consequence,  it  is  a  natural  choice  to  use  the  word  clustering 
information  in  the  process  of  feature  selection  and  select  discriminative  words  to 
form  feature  subspacc,  rather  than  directly  using  the  word  clusters  as  features  to 
represent  documents  in  the  corpus. 

-  Second,  by  exploiting  word  cluster  information,  we  extend  three  well-known  un¬ 
supervised  feature  selection  methods  and  propose  three  new  methods.  To  compare 
the  text  clustering  performance  with  features  selected  by  the  new  methods  and  the 
original  methods,  w  e  conduct  a  series  of  comparative  experiments  on  3  benchmark 
data  sets,  i.e.,  Reuters-21578,  20  Newsgroups,  and  CLASS1C3  and  the  results  have 
show  n  that  the  new  methods  can  select  better  features  w  ith  high  document  cluster¬ 
ing  performance  than  the  original  methods. 
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The  rest  of  this  paper  is  organized  as  follows.  In  Section  2  we  describe  the  word  clus¬ 
tering  algorithm  and  the  similarity  measure.  Section  3  presents  the  proposed  new  unsu¬ 
pervised  feature  selection  methods.  Section  4  describes  the  data  sets  and  the  evaluation 
methods  used  in  our  experiments.  Section  5  gives  a  detailed  analysis  of  the  experiment 
results.  Finally,  we  conclude  this  paper  in  Section  6. 

2  Word  Clustering 

Data  Clustering  is  a  challenging  held  of  data  mining  research  in  which  its  potential 
applications  pose  its  own  special  requirements^].  Clustering  is  a  algorithm  to  group 
the  data  into  clusters,  so  that  objects  within  the  same  cluster  are  more  similar  to  ob¬ 
jects  in  other  clusters.  Often,  the  clustering  performance  is  influenced  by  the  clustering 
algorithm  and  the  similarity  measure.  The  choice  of  clustering  algorithm  and  similar¬ 
ity  measure  must  be  suitable  for  the  application  target.  Without  appropriate  clustering 
algorithm  and  similarity  measure,  the  clustering  results  can  be  useless  or  meaningless. 

For  word  clustering  task,  there  are  two  typical  requirements.  First,  the  text  data  set  is 
sparse  and  high-dimensional  the  word  clustering  algorithm  should  be  good  at  finding 
clusters  in  high-dimensional  sparse  space.  Second,  the  clustering  algorithm  is  required 
to  run  efficiently  in  real-world  applications.  Some  clustering  algorithms  may  work  well 
on  handling  high-dimensional  sparse  data  set,  but  they  are  too  time  consuming  or  re¬ 
quire  users  to  input  certain  parameter  values,  such  as  the  number  of  clusters.  These 
constrains  make  them  difficult  to  use. 

The  single-linkage  algorithm  is  an  efficient  clustering  method  that  can  provide  a  so¬ 
lution  to  word  clustering  task  It  is  a  bottom-up  agglomcrative  method  that  group  data 
into  a  tree  of  clusters  terminated  when  the  distance  between  two  nearest  clusters  ex¬ 
ceeds  a  certain  threshold.  Initially,  the  single-linkage  algorithm  places  each  object  into 
individual  cluster  of  its  own.  The  clusters  arc  then  merged  step-by-step  according  to 
some  defined  similarity  measure.  Each  cluster  is  represented  by  all  of  the  objects  in  the 
cluster,  the  similarity  between  two  clusters  is  measured  by  the  similarity  of  the  closet 
pair  of  data  objects  belonging  to  two  clusters[8|.  In  clustering  the  objects,  a  predeter¬ 
mined  minimal  similarity  threshold  is  served  as  the  halting  criterion. 

For  word  clustering,  a  measure  to  compute  similarity  between  words  is  required. 
Often,  similarities  arc  assessed  based  on  the  word  vector  values  in  the  vector  space 
model.  Our  raw  knowledge  about  the  value  of  a  word  is  its  frequencies  in  documents. 
More  generally,  we  can  represent  the  word  vector  by  its  frequencies  over  documents  in 
the  training  corpus,  i.e,  t  =  (w(d\ ,  t ). ...,  w(d\p\ ,  /)),  where  w(t,d)  is  the  frequency  of 
term  t  in  document  d.  For  computational  reason,  in  what  follows,  we  only  consider  the 
presence  or  absence  of  a  word  in  the  document,  that  is: 


0) 


We  define  the  similarity  between  two  word  vector  as  follows: 


S(ti,tj)  =  min( 


INI, ’INI, 


(2) 
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This  similarity  measure  is  a  natural  choice  for  word  clustering  task  for  its  simple¬ 
ness  and  scalability.  The  result  of  this  similarity  measure  is  in  the  range  of  0  and  1, 
it  is  zero  just  when  t  and  t3  are  independence.  The  ratio  increases  as  co-occurrence 
between  two  word  vectors  increases,  and  bounded  by  1.  We  can  thus  use  the  co-occur 
observation  between  two  words  to  measure  how  likely  they  are  to  be  instances  of  the 
same  cluster.  Since  the  range  of  the  similarity  is  in  the  range  between  0  and  1,  it  is 
thus  more  feasible  to  specify  a  similarity  threshold  to  determine  the  termination  in  the 
process  of  clustering.  We  can  set  the  similarity  threshold  to  be  0.5.  Since  the  purpose 
of  the  word  clustering  algorithm  is  to  provide  an  implicitly  additional  correlation  infor¬ 
mation  between  various  words,  the  performance  of  feature  selection  is  not  sensitive  to 
the  similarity  threshold. 

3  Word  Clusters  for  Unsupervised  Feature  Selection 

After  clustering  all  words  in  the  corpus  into  clusters,  an  additional  step  to  exploit  the 
word  cluster  information  is  added  before  selecting  features.  We  then  blend  the  word 
cluster  information  with  the  bag-of-word  information  to  measure  the  goodness  of  in¬ 
dividual  features.  With  this  hybrid  method,  we  extend  three  well-known  unsupervised 
feature  selection  methods,  i.e.  Document  Frequency,  Term  Contribution  and  Term  Qual¬ 
ity,  and  proposed  three  new  methods,  called  word-cluster  approach,  in  which  both  word 
and  word  clusters  information  are  included. 


3.1  Word  Clusters  for  Document  Frequency  (DF  and  wcJ)F) 

Document  frequency  (DF)  as  a  feature  selection  criterion  for  a  term  t  can  be  described 
as  follows,  DF(t)  =  \Dt\*  where  \Dt\  is  the  number  of  documents  in  which  term  t 
occurs.  The  higher  the  document  frequency,  the  better  the  feature. 

The  DF  feature  selection  method  can  be  used  for  both  supervised  document 
categorization!  13]  and  unsupervised  document  clustering! 9].  This  method  assume  that 
the  contribution  of  low  frequency  words  is  insignificant.  Improvement  in  performance 
is  also  possible  if  low  frequency  terms  happened  to  be  noises.  However,  low  frequency 
words  may  contain  useful  discriminative  information  in  clustering  data  in  the  domain 
of  high  dimensional  with  many  classes.  This  idea  is  consistent  with  the  popular  inverse 
document  frequency  weighting  scheme  in  the  area  of  information  retrieval. 

To  extend  the  DF  feature  selection,  we  take  the  word  cluster  size  into  account  in  the 
new  word-cluster  DF  method.  Formally,  the  word-cluster  DF  criterion  for  term  t  can  be 
de lined  as  follows: 

wc.DF(l)  =  \D,\  x  (1  +  log|C(  |)  (3) 

where  |C*|  is  the  size  of  cluster  in  which  term  t  is  included.  Here,  \Pt\  refer  to  the 
importance  in  document  aspect,  and  \Ct  \  refer  to  the  importance  in  word  aspect.  This 
method  indicates  that  the  importance  of  a  term  t  can  be  improved  by  increasing  the 
number  of  documents  containing  the  term  or  the  number  of  words  correlated  with  the 
term. 
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3.2  Word  Clusters  for  Term  Contribution  (TC  and  \\c_TC) 

Term  contribution  is  proposed  by  Liu  et  al.[9J.  It  is  a  criterion  to  measure  the  contribu¬ 
tion  of  a  term  to  discriminate  documents  in  the  data  set.  In  this  method,  the  contribution 
of  a  term  is  equivalent  to  its  contribution  to  all  pairwise  document  similarity  in  the  cor¬ 
pus.  Formally,  the  contribution  of  a  term  t  to  the  similarity  of  a  pairwise  document  di 
and  dj  can  be  defined  as  follows: 

TC(tsdi.dj)  =  w(t<dj)  x  w(t,  dj)  (4) 

The  contribution  of  term  t  to  all  pairwise  document  in  the  corpus  is  defined  as  follows: 

TC(t)  =  w(t,di)  x  U'it.dj)  (5) 

where  w(t,di)  and  w(t,dj)  is  the  weight  value  of  term  t  in  document  dt  and  dj ,  re¬ 
spectively.  Often,  the  t  f-idf  weight  value  is  used.  Formally,  the  t  f-idf  value  of  term  t 
in  document  d  can  be  computed  as  follows: 

uit-d)  =  Sh'°e\k  (6) 

t 

where  ifa  is  the  term  frequency  of  term  t  in  document  (/,  tftd  is  the  sum  of  all  term 

t 

frequencies  in  document  d,  it  is  used  to  normalize  t  ft({  to  prevent  a  bias  towards  longer 
documents,  N  is  the  total  number  of  documents  in  the  corpus,  and  \Dt  \  is  the  number 
of  documents  in  which  term  t  occurs,  here  N/\Dt\  refer  to  be  the  inverse  document 
frequency  of  term  t. 

To  extend  the  TC  feature  selection  method,  wc  propose  a  new  t f-idf  type  weighting 
scheme  in  which  both  word  and  word  cluster  information  arc  included.  Formally,  the 
word-cluster  t  f-idf  value  of  a  term  t  can  be  defined  as  follows: 

w'(t.d)  =  x  loRj7^j  x  +loslOI)  (7) 

t 

where  |C*|  is  the  size  of  cluster  in  which  term  t  is  included.  High  value  in  the  new 
weight  scheme  corresponding  to  high  term  frequency,  low  document  frequency  and 
significant  word  cluster  in  the  corpus. 

The  word-cluster  TC  criterion  is  then  given  by  straightforward  applying  the  new 
t f-idf  weight  scheme  to  the  original  word-solely  TC  criterion.  It  is  given  by: 

wc.TC(t)  =  w'(Mi)xu/(*,rfj)  (8) 

3.3  Word  Clusters  for  Term  QuaIity(Qoi  Qi  and  wc_Q) 

Term  quality  is  proposed  and  evaluated  by  Dhillon[5].  It  is  a  criterion  to  measure  the 
goodness  of  a  target  term  via  its  distribution.  If  we  consider  two  major  variables,  t  and 
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I)  with  respect  to  the  target  term  and  the  documents  in  the  corpus.  Our  knowledge  to 
the  correlation  between  them  is  the  term  frequency  t  fui  of  occurrence  in  pairs  (/,  d)  in 
the  corpus.  With  this  knowledge,  we  define  the  distribution  of  a  term  t  over  documents 
D  to  be  pt{d)  =  t  ftli*  where  t  ft(j  is  the  term  f  requency  of  term  t  in  document  </,  we 
refer  to  this  distribution  as  document  distribution. 

To  evaluate  the  feature  goodness  according  to  its  document  distribution.  Dhillon  de¬ 
fined  the  distribution  variance  as  a  measure  of  the  discriminative  capability  of  a  dis¬ 
tribution.  The  variance  of  document  distribution  ]>t  over  documents  in  the  corpus  is 
given  by: 


Var(p,.D)  =  E\pf]  -  E\j>,}2  = 


E  tfl 

den 

\D\ 


1 

w 


2 


E 

.den 


(9) 


where  E[pt]  is  the  expected  value  (mean)  of  distribution  pt. 

Dhillon  applied  the  variance  formula  to  feature  selection  and  defined  a  similar  crite¬ 
ria  call  term  quality  Q 0: 


Q()(t)  =  \D\xVar(Pl.D)=Y,tfl 

deD 
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.den 
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Qo  criteria  is  influenced  by  the  dispersion  of  all  documents  in  the  corpus.  The  larger 
the  dispersion,  the  better  the  term  discriminate  capability.  However,  it  has  poor  perfor¬ 
mance  w  hile  applying  to  sparse  data  set  in  which  a  large  portion  of  low  f  requency  terms 
only  occur  in  documents  related  to  a  particular  category.  There  is  a  big  chance  these  low 
frequency  terms  are  missed  while  using  Qo  to  select  feature  subspace.  To  remedy  this, 
Dhillon  introduced  another  term  quality,  called  Q],  which  is  influenced  by  the  disper¬ 
sion  of  the  documents  that  contain  the  target  term  at  least  once.  Formally,  it  is  given  by: 


Ql(t)  =  \D'\xVar(p,.D)  =  £  if ,2, 

dety 
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.den' 


2 


(ID 


where  Df  is  the  document  set  in  w  hieh  term  t  occurs  at  least  once.  A  major  difference 
between  Qo  and  Q\  is  that  Qo  measure  the  target  term  through  the  aspect  of  total 
documents,  whereas  Qi  only  considers  the  documents  in  which  the  target  term  occurs. 

To  extend  the  term  quality  feature  selection  method,  we  define  a  new  distribution 
for  the  target  term  that  is  the  document  frequency  value  on  its  word  cluster.  Here  we 
consider  two  variables,  the  target  term  t  and  its  word  cluster  Ct .  Our  knowledge  to 
the  correlation  between  them  is  the  document  frequency  dflr  of  word  tr,  and  w  is  in  the 
word  cluster  in  which  term  t  included.  Thus  the  distribution  of  a  term  t  over  word  cluster 
Ct  is  then  given  by  pt(w)  —  we  refer  to  this  distribution  as  word  distribution. 

The  distribution  variance  pt  over  words  in  the  cluster  Ct  is  given  by: 


Var{p,  ,C,)  =  E[pf]  E[pt]2  = 


E  tfi 

wee, 


E"/. 
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(12) 
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The  word-cluster  term  quality  criterion  is  then  given  by  putting  the  document  distribu¬ 
tion  variance  and  word  distribution  variance  into  together  as  follows: 

wcjQ(t)  =  Var(pt<D)  x  V ar(pt ,  Ct )  (13) 

The  combination  of  the  document  variance  and  word  variance  provides  additional  word- 
cluster  knowledge  and  thus  makes  the  new  method  less  sensitive  to  term  frequency. 


4  Experimental  Results 

To  compare  the  document  clustering  performances  with  respect  to  features  selected  by 
the  new  methods  and  features  selected  by  the  original  methods,  we  conduct  a  series  of 
comparison  experiments  on  three  public  benchmark  text  data  sets,  i.e.,  20  Newsgroups, 
Reuters-2 1 578  and  CLASS1C3. 

4.1  DataSets 

The  CLASSIC3  data  set  [7]  is  available  on  the  SMART  system  from  Cornell's  Web 
site *  l.  It  consists  of  33,242  features  and  3,896  document  abstracts,  and  contains  three 
categories,  i.e.,  MEDLINE,  CIS1,  and  CRANFIELD,  from  different  specific  domains. 
MEDLINE  consists  of  1,033  abstracts  from  medical  papers,  CISI  consists  of  1,460  ab¬ 
stracts  from  information  retrieval  papers,  and  CRANFIELD  consists  of  1,400  abstracts 
from  aeronautical  systems  papers.  The  characteristics  between  these  categories  are  very 
different  The  overlapping  between  keywords  of  different  categories  is  not  large.  It  is 
thus  not  difficult  to  cluster  the  CLASS1C3  corpus. 

The  Reuters-21578  is  a  corpus  for  text  mining  contains  21,578  new  stories  appeared 
in  the  Reuters  newswire  in  19872.  We  used  the  modified  Apte  (“ModcApte")  split  con¬ 
tains  9,603  training  documents  and  3,299  test  documents.  But  we  discarded  those  doc¬ 
uments  have  no  labels,  and  the  remained  data  set  consists  of  7,063  training  documents 
and  2,742  test  documents.  Furthermore,  we  generated  our  Reuters  subset  by  select¬ 
ing  the  largest  10  categories  which  have  maximum  positive  training  documents[4],  and 
then  discarded  those  documents  belong  to  more  than  one  category.  The  resulted  Reuters 
subset  contains  19,206  features  for  5,973  documents.  It  is  note  that  the  number  of  doc¬ 
uments  in  different  categories  are  very  different.  The  largest  category  contains  2,698 
documents,  whereas  the  smallest  category  only  contains  80  documents.  The  purpose  of 
generating  this  data  set  is  to  evaluate  the  performance  on  corpus  in  domain  of  many 
imbalance  classes. 

The  20  Newsgroups  corpus  contains  19,997  documents  from  the  Usenet  newsgroups 
collection3.  In  the  experiment,  we  used  a  benchmark  minisubset  of  20  Newsgroups 
corpus  that  provide  by  UCI  machine  learning  archive  [1].  The  20  Newgroups  subset 

1  CLASS IC3  can  be  found  at:  ftp://ftp.cs.cornell.edu/pub/smart 
Reuters-21578  can  be  found  at: 

http.//www.daviddlewis.com/resources/testcollcctions/rcuters2I578/ 

1  The  20  Newsgroups  can  be  found  at* 

http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html 
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consists  of  2,000  news  documents  and  26,620  features.  It  contains  20  categories,  each 
of  which  has  100  documents.  Most  of  the  document  is  designated  into  one  category,  but 
the  categories  in  the  corpus  are  sementicaily  close,  such  as  “comp,  sys.imb.pc.  hard  ware" 
and  “comp.sys.  mac.  hard  ware”,  “comp.graphics”  and  “comp: windows,  x”.  The  similar¬ 
ity  and  overlapping  between  different  categories  makes  it  difficult  to  correctly  group  the 
20  Newsgroups  corpus  into  clusters.  The  purpose  of  choosing  this  data  set  is  to  evaluate 
the  performance  on  corpus  in  domain  of  many  similar  classes. 

These  three  corpus  are  frequently  used  as  benchmark  data  set  in  the  task  of  text 
mining.  They  represented  considerable  diversity  of  number  of  classes,  data  in  size,  data 
imbalance  and  data  similarity.  We  preprocess  these  corpus  using  the  DRAGON  toolkit 
[14].  The  detail  of  these  three  corpus  are  list  in  Table  1 

Table  I.  Data  sets  used  in  the  experiments 


Data  set 

CLASS1C3 

Reuters 

20NG 

#  Variable 

33,242 

19,206 

26,620 

#  Sample 

3,896 

5,973 

2.000 

#  Class 

3 

10 

20 

Class 

Name 

#  Sample 

Name 

#  Sample 

Name 

#  Sample 

Cl 

MEDLINE 

1,033 

Earn 

2,698 

1 

100  1 

C2 

CISI 

1,456 

Acq 

1,471 

2 

100 

C3 

CRANFIELD 

1.400 

Money-fx 

401 

3 

100 

C4 

Grain 

334 

4 

100 

C5 

Crude 

295 

5 

100 

C6 

Trade 

292 

6 

100 

C7 

Interest 

169 

7 

100 

C8 

Ship 

134 

8 

100 

C9 

Money-supply 

99 

9 

100 

CIO 

sugar 

80 

10 

100 

100 

C20 

20 

100 

4.2  Evaluation  Measures 

We  use  two  quality  measures  to  evaluate  the  effectiveness  of  the  selected  features  for 
text  clustering,  i.c.,  Entropy(E),  and  F-measure(F).  The  first  measure  Entropy  provides 
a  measure  of  the  purity  of  a  cluster.  The  cluster  contains  a  large  portion  of  objects  from 
different  classes  has  a  large  entropy.  The  smaller  the  entropy,  the  better  the  performance. 
The  second  measure  F-Measurc  is  a  common  used  performance  evaluation  in  informa¬ 
tion  retrieval.  It  combines  the  effect  of  precision  and  recall.  The  higher  the  F-measure, 
the  better  the  clustering  result. 

For  the  entropy  evaluation  measure,  we  denote  C  =  {C*i ,  C2 . Q  }  as  the  obtained 

clusters  and  C *  =  {Cjf ,  C*,  ...,  CA!/ }  as  the  correct  classes,  k  and  kf  respectively  to  their 
cluster  number.  Let  |C*|  be  the  number  of  documents  in  7th  obtained  cluster  and  \C*  \ 
be  the  number  of  documents  in  ?th  corrected  class.  Given  Cj  £  C.  the  the  entropy  of  a 
target  cluster  Cj  is  defined  to  be: 
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k' 


(14) 


where  p,j  is  the  probability  that  the  target  cluster  C*  belongs  to  the  correct  class  C* . 
Formally,  pjj  is  given  by: 


Pij  = 


l^nq| 

\Ct\ 


(15) 


where  |Ci  H  C*  |  is  the  number  of  documents  of  the  class  Cj  that  are  assigned  to  cluster 
Ci,  that  is,  the  overlap  between  C*  and  C*.  Given  |C|  as  the  total  number  of  documents 
in  the  corpus,  the  total  entropy  for  the  target  clusters  is  computed  as  follows: 


\Ci\ 

|C| 


Ei 


(16) 


Given  Ci  £  C  and  C*  £  C*,  the  precision,  recall  and  F-Measure  of  the  target  cluster 
Cj  with  respect  to  class  C*  is  defined  to  be:  P{i,j)  =  |C*  0  C*\/\Ci\,  7?(z,  j)  = 
\Ci  nqi/iqi  and  F(i,j)  =  2 P(iJ)  ■  R(i,j)/(P(iJ)  +  R(i,j)). 

For  a  particular  target  cluster,  we  choose  the  class  that  shares  with  most  documents 
with  the  target  cluster  as  the  correct  class  to  evaluation  its  performance.  That  is,  F*  = 
max{F(i,  j)\j  —  1 . A:}.  The  total  F-Measure  for  all  clusters  is  defined  as  follows: 


(17) 


4.3  Comparison  Experiments 

To  compare  the  document  clustering  performance  with  respect  to  feature  subspace  se¬ 
lected  by  the  word-cluster  methods  and  feature  subspace  selected  by  the  word-solcly 
methods,  wc  conduct  a  series  comparison  experiments  on  the  CLASS1C3,  Reuters- 
21578,  and  20  Newsgroups  corpus.  In  these  experiments,  we  used  the  single-linkage 
algorithm  and  similarity  measure  introduced  in  Section2  to  group  words  into  clus¬ 
ters,  and  the  default  similarity  threshold  is  set  to  be  0.5.  For  documents  clustering, 
we  used  the  group-average  agglomerative  method  as  document  clustering  algorithm. 
The  distance  between  two  clusters  is  measured  by  the  average  cosine  distance  between 
documents  with  respect  to  two  clusters. 

Wc  carried  out  the  document  clustering  algorithm  on  three  corpus.  For  each  corpus, 
the  document  clustering  algorithm  was  executed  in  different  percentage  of  features  from 
2  to  40.  For  the  same  percentage,  we  carried  out  the  document  clustering  algorithm  with 
features  selected  by  each  method,  and  computed  their  F-Measure(F)  and  Entropy(E) 
results.  The  results  we  reported  are  averaged  over  the  3  folds  cross  validation.  Fig.  1, 
Fig.  2  and  Fig.  3  show  the  F-Measure  and  Entropy  comparison  results.  In  the  figures,  the 
left  labels  correspond  to  the  F-Measure(F)  scale,  and  the  right  labels  correspond  to  the 
Entropy(E)  scale.  For  the  F-Measure  results,  the  solid  triangle  plots  are  the  F-Measure 
score  of  the  word-cluster  methods,  and  the  solid  square  plots  are  the  F-Measure  score 
of  the  word-solely  methods.  For  the  entropy  results,  the  hollow  triangle  plots  refer  to 
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(a) 


(b) 


(c) 


Fig.  1.  Comparison  on  CLASSIC3  corpus  with  different  feature  subspaces,  (a)(b)(c)  are  methods 
with  respect  to  document  frequency,  term  contribution  and  term  quality 


the  entropy  score  of  the  word-cluster  methods,  and  the  hollow'  square  plots  refer  to  the 
entropy  score  of  the  word-solely  methods. 

Figure  1  shows  the  plots  of  the  F-Measure  and  Entropy  results  on  the  CLASSIC3 
corpus  with  different  feature  subspaces  separately  selected  by  the  word-cluster  meth¬ 
ods  and  the  w'ord-solely  methods.  Wc  can  clearly  see  that  the  word-cluster  methods 
outperform  the  word-solely  methods  in  both  F-Measure  and  Entropy  in  all  feature  sub¬ 
spaces,  i.e.,  w  ord-cluster  methods  attained  higher  F-Measure  and  lower  Entropy  results. 
Specifically,  for  the  document  frequency  type  methods  showed  in  Fig.  1(a),  wc  can  ob¬ 
serve  that  the  word-cluster  method  significantly  outperform  the  word-solely  method,  the 
F-Measure  of  the  word-cluster  method  wc_DF  is  almost  20%  larger  than  the  F-Measure 
of  the  word-solely  method  DF.  For  the  term  contribution  type  methods  showed  in  Fig. 
1(b),  we  can  see  that  the  performance  of  the  word-cluster  method  and  the  w  ord-solely 
method  is  comparable,  but  the  word-cluster  method  wc_TC  is  better  on  8%  feature  sub¬ 
space  and  10%  feature  subspace.  Another  observation  is  that  the  performance  decease 
rapidly  after  selecting  20%  feature  subspace.  For  the  term  quality  type  methods  showed 
in  Fig.  1(c),  wc  can  see  that  the  square  plots  are  below  the  triangle  plots  for  F-Measure 
result  comparison  in  all  feature  subspaces. 

On  the  w  hole,  the  performance  of  the  word-cluster  methods  and  the  performance  of 
the  word-solely  methods  are  comparative  when  the  percentage  of  feature  subspace  is 
less  than  6%.  As  the  feature  percentage  increase,  we  can  observe  that  the  word-cluster 
methods  are  more  stable,  because  the  curves  of  the  word-cluster  methods  arc  smooth, 
w  hile  the  curves  of  the  word-solely  methods  are  uneven 

Figure  2  show  s  the  plots  of  the  F-Measure  and  Entropy  results  on  the  Reuters-2 1 578 
corpus  w  ith  different  feature  subspaccs  separately  selected  by  the  word-cluster  methods 
and  the  word-solely  methods.  Wc  can  see  that  the  average  performance  on  this  corpus 
is  worse  than  those  on  the  CLASS IC3  corpus.  The  decrease  of  performance  on  the 
Rcuter-21578  corpus  may  due  to  the  imbalance  property  of  this  corpus.  But  wc  can 
clear  see  that  the  word-cluster  methods  outperform  the  word-solely  methods  on  small 
feature  subspaccs  that  are  less  than  10%.  This  result  indicates  that  the  word-cluster 
methods  are  especially  effective  while  only  selecting  a  small  feature  subspace.  Another 
observation  is  that  the  performance  of  the  word-cluster  methods  are  similar  to  those  of 
the  word-solely  methods  on  large  feature  subspace.  In  fact,  the  word-cluster  methods 
slightly  outperform  on  most  of  the  percentage.  When  the  selected  feature  subspacc  is 
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(a)  (b)  (c) 


Fig.  2.  Comparison  tin  Reuters-21578  corpus  with  different  feature  subspaccs,  (a)(b)(c)  are  meth¬ 
ods  with  respect  to  document  frequency,  term  contribution  and  term  quality 


(a) 


(b) 


(c) 


Fig.  3.  Comparison  on  20  Newsgroup  corpus  with  different  feature  subspaces,  (a)(b)(c)  are  meth¬ 
ods  with  respect  to  document  frequency,  term  contribution  and  term  quality 


large,  both  word-cluster  method  and  word-solely  method  have  a  big  chance  to  select 
feature  subspace  in  which  no  informative  features  are  included,  and  thus  the  document 
clustering  performance  is  reduced. 

Figure  3  shows  the  plots  of  the  F-Measure  and  Entropy  results  on  the  20  News- 
groups  corpus  with  different  feature  subspaces  separately  selected  by  the  word-cluster 
methods  and  the  word-solely  methods.  The  overall  performance  on  this  corpus  is  com¬ 
paratively  worse  than  those  performance  on  other  two  corpus,  because  the  categories  in 
the  20  Newsgroups  are  semantic  similar  and  overlapped  to  each  other  However,  the  im¬ 
provement  of  the  word-cluster  methods  is  significant  in  this  corpus.  We  can  see  that  the 
word-cluster  methods  clearly  outperform  the  word-solely  methods  in  almost  all  feature 
subspaces.  This  result  indicates  that  the  word-cluster  methods  is  especially  effective  for 
complex  corpus  in  the  domain  of  many  classes  and  high  overlapping. 

In  summary,  the  experiment  results  show  that  the  word-cluster  methods  outperform 
the  word-solely  methods  in  most  of  feature  subspace.  These  results  indicate  that  the 
word-cluster  methods  could  select  more  discriminative  features,  and  thus  the  document 
clustering  performance  is  improved. 


5  Conclusions 

In  this  paper,  we  define  a  clustering  algorithm  and  a  similarity  measure  to  group  words 
in  the  corpus  into  clusters,  and  blend  the  word  cluster  information  with  the  bag-of-word 
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information  to  select  feature  subspaee  for  document  clustering.  In  this  way,  we  extend 
three  well-known  unsupervised  selection  methods  and  proposed  three  new  methods.  We 
have  conducted  a  series  of  comparison  experiments  on  three  benchmark  corpus,  and  the 
results  show  that  the  document  clustering  performance  oil  feature  subspaces  selected  by 
the  word-cluster  methods  outperform  those  selected  by  the  word-solcly  methods. 
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Abstract.  Unlike  traditional  classification  tasks,  multilabel  classification  allows 
a  sample  to  associate  with  more  than  one  label.  This  generalization  naturally 
arises  the  difficulty  in  classification  Similar  to  the  single  lahel  classification  task, 
neighborhood-based  algorithms  rely  ing  on  the  nearest  neighbor  have  attracted 
lots  of  attention  and  some  of  them  show  positive  results.  In  this  paper,  we  propose 
an  Adaptive  Neighborhood  algorithm  for  multilabel  classification.  Constructing 
an  adaptive  neighborhood  is  challenging  because  specified  information  about  the 
neighborhood,  e.g.  similarity  measurement,  should  be  determined  automatically 
during  construction  rather  than  provided  by  the  user  beforehand.  Few  literature 
has  covered  this  topic  and  we  address  this  difficulty  by  solving  an  optimization 
problem  based  on  the  theory  of  sparse  representation.  Taking  advantage  of  the  ex¬ 
tracted  adaptive  neighborhood,  classification  can  be  readily  done  using  weighted 
sum  of  labels  of  training  data.  Extensive  experiments  show  our  proposed  method 
outperforms  the  state-of-the-art. 


1  Introduction 

Multilabcl  classification  has  been  a  popular  issue  in  pattern  recognition  &  machine 
learning  and  is  encountered  in  a  variety  of  application  domains.  For  instance,  in  biology, 
a  gene  or  protein  may  posse  several  functionalities  and  in  natural  scene  classification,  a 
picture  of  the  beach  may  also  include  boats,  trees  and  even  a  city  as  its  contents.  Behind 
these  appearances  lies  the  fact  that  one  object  is  allowed  to  associate  w  ith  more  than  one 
labels.  Solving  classification  tasks  of  multilabel  scenario  is  naturally  a  generalization 
of  traditional  task  and  posesses  much  more  practical  value  as  well  as  difficulties. 

Several  methods  taking  advantage  of  traditional  classification  algorithm,  e.g.  Ad- 
aBoost,  SVM,  EM,  have  been  proposed  to  solve  this  problem.  Recent  research  [1,2] 
shows  that  neighborhood-based  algorithms  relying  on  the  nearest  neighbor  can  achieve 
good  results  in  multilabcl  classification  task,  just  like  in  case  of  single  label.  How¬ 
ever,  the  way  of  choosing  neighborhood  in  these  works  is  based  on  K  Nearest  Neigh- 
bor( KNN j,  in  which  several  parameters  should  be  given  in  advance  such  as  the  similarity 
measurement  and  the  size  of  the  neighborhood  I\.  Constructing  an  adaptive  neighbor¬ 
hood  that  can  get  rid  of  these  specifications  would  be  helpful  but  challenging.  In  this 
paper,  we  address  the  difficulty  by  extracting  this  adaptive  neighborhood  with  an  op¬ 
timization  problem  based  on  the  theory  of  sparse  representation  and  further  use  it  for 
multilabel  classification.  To  our  best  knowledge,  we  are  not  aware  of  any  similar  work 
using  this  technique  to  handle  the  multilabel  classficiation  problem. 

B.-T.  Zhang  and  M.A.Orgun  (Eds.):  PRICAI  2010,  LNAI  6230,  pp.  304-314,  2010. 
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The  rest  of  the  paper  is  organized  as  follows.  In  the  next  section  we  give  a  brief  re¬ 
view  of  previous  work  on  the  topic  of  multilabcl  classification  and  sparse  representation. 
Then  we  present  our  Adaptive  N eighborhoocl( AN)  algorithm  and  report  the  experimen¬ 
tal  results.  Finally  we  conclude  this  paper  and  point  out  some  promising  work  in  the 
future. 

2  Related  Work 

2.1  Multilabcl  Classification 

Mu lti label  classification  began  to  be  widely  concerned  due  to  the  work  of  Schapire  and 
Singer  [3].  They  presented  a  boosting-based  system  BoosTcxter  for  text  categorization 
and  also  provided  several  useful  measurements  that  can  be  extended  to  other  multilabel 
classification  tasks.  Besides,  they  pointed  out  that  controlling  the  complexity  of  the 
overall  learning  system  is  an  important  research  issue.  To  control  the  this  complexity 
while  having  a  small  empirical  error,  Elisseef  and  Weston  proposed  the  RankSVM  [4] 
method.  As  in  Support  Vector  Maehine(SVM),  a  linear  model  is  defined  so  as  to  min¬ 
imize  the  empirical  error  measured  by  the  ranking  loss  and  control  the  complexity  of 
the  resulting  model  simultaneously. 

Zhang  and  Zhou  introduced  a  lazy  way  of  multilabel  classification  named  ML-KNN  [  1  ]. 
In  their  algorithm,  K  Nearest  Neighbor  in  the  training  set  is  first  computed  for  an  unseen 
sample,  then  a  Maximum  A  Posteriori(MAP)  method  is  taken  to  perform  the  classifica¬ 
tion,  based  on  the  statistical  information  gained  from  the  label  sets  of  neighbor  instances. 
Motivated  by  this  lazy  way  method,  Cheng  and  Hullermeier  gave  IBLR-ML  algorithm  ( 2 ] 
which  combines  the  instance-based  learning  and  logistic  regression  and  allows  one  to 
capture  the  interdependencies  between  the  class  labels  in  a  proper  way.  Experiments  on 
publie  data  sets  show  that,  among  several  existing  multilabel  classification  algorithm, 
both  ML-KNN  and  IIUR-ML  show  not  only  positive  results  but  also  achieve  the  state- 
of-the-art  classification  performance.  However,  both  of  these  methods  are  based  on  the 
KNN  which  can  easily  falls  into  the  predicament  of  suitable  similarity  measurements 
and  the  size  of  the  neighborhood. 

2.2  Sparse  Representation 

Theory  of  Sparse  Representation  is  closely  related  to  our  work  It  has  been  quite  popular 
in  machine  learning  area,  ineluding  face  recognition  [5],  dimensionality  reduction  [6|, 
image  super-resolution  [7]  and  image  denoising  [8].  Sparse  solution  of  underdetermined 
systems  of  linear  equations  lies  at  the  heart  of  this  theory.  As  stated  in  [9],  finding  such 
solution  can  be  formulated  as  the  following  optimization  problem(/ o): 

min  ||u’||o 

W 

s.t  z  =  X  tr 


(I) 
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Unfortunately,  although  /o-norm  is  a  straightforward  measurement  of  sparsity,  problem 
P0  has  been  proved  to  be  NP-hard  j  1 0],  To  overcome  this  prohibitive  computation  issue, 
a  compromising  way  is  to  deal  with  P\  instead: 

min  |  |w||  i 

W 


which  is  a  convex  optimization  and  can  be  readily  solved  by  linear  programming  (111. 
Pi  is  the  central  focus  of  sparse  representation  and  has  been  shown  to  have  exactly  the 
same  solution  as  Pq  when  the  solution  is  very  sparse.  [9] 

Sparse  representation  has  been  involved  in  many  classification  tasks,  one  of  which 
belongs  to  Wright’s  work  [5]  on  robust  face  recognition.  According  to  their  paper,  sam¬ 
ples  from  the  same  class  are  modeled  as  lying  on  a  linear  subspace.  Given  sufficient 
training  samples  of  the  ith  class,  Xt  =  [x^i,--*  E  Rdxn',  any  test  sample 

z  €  Rd  from  the  same  class  would  be  able  to  be  approximately  written  as  the  linear 
combination  of  training  samples  associated  with  the  ith  class: 


2  =  WiXiA  + - F  wnixi%ni  =  AY w 


Following  the  idea  above,  for  any  unseen  sample,  finding  a  sparse  representation  in  all 
the  training  samples  would  typically  yield  the  solution  with  nonnegative  entries  associ¬ 
ated  with  training  examples  of  the  same  class,  as  shown  in  the  following  results,  from 
which  we  can  see  that  sparse  representation  is  able  to  capture  the  discriminant  nature 
behind  the  samples: 

2  =  Xw  =  [AY ,  •  •  -  >  AY](0.  •  •  •  ,  0,  Wi ,  •  •  •  *  wUt  .().••■  ,  O]7 


The  sparse  representation  can  be  obtained  by  solving  P\.  In  realistic  tasks,  the  exact 
representation  of  test  sample  may  not  be  able  to  achieved  due  to  noise.  Usually  a  stable 
version  is  considered  instead: 


min 

w 

s.t 


|  |w||i 

||2  —  Xw\\2  <  e 


(3) 


where  e  is  an  error  tolerance.  This  is  an  convex  programming  and  can  be  efficiently 
solved.  With  the  obtained  representation,  prediction  of  a  test  sample  is  able  to  be  made 
by  choosing  the  class  with  least  residual.  The  algorithm  achieve  positive  results  on 
several  public  data  sets  with  high  accuracy  and  robustness  to  occlusion. 


3  Adaptive  Neighhor(AN)  Algorithm 

3.1  Problem  Setting 

Consider  the  following  multilabel  classification  with: 

training  set:  Tr  =  {(,rt.  Vy ) } f  , ,  (x*  E  .  yx  E  $0 
test  set:  Te  —  {27}£L i,  (2*  E  3P) 

Our  goal  is  to  learn  a  classifier: 


/  :  «r  x 
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which  tends  to  assign  higher  value  to  (z,yL)  if  y,  belongs  to  Yz.  From  /  we  can  eas¬ 
ily  predict  the  label  of  an  unseen  sample,  e.g.  predict(z,  t/*)  =  lf(j\yt)  >  0],  0  is 
a  threshold.  Another  statistic  we  would  like  to  gain  is  the  rank  information  between 
different  labels,  the  function  rankf(z ,  iji)  ranks  different  labels  according  to  the  corre¬ 
sponding  value  of  f(z,  yt),  where  higher  value  of  /  gets  lower(better)  rank  position. 


3.2  Our  Method 


Extensively  applied  in  different  machine  learning  tasks,  ranging  from  single  label  clas¬ 
sification  to  dimensionality  reduction  [  1 2, 1 3 J  and  multilabel  classification  [1],  KNN 
usually  serves  as  an  intermediate  step  to  seek  the  connections  between  samples.  How¬ 
ever,  neighborhood  information  gained  from  KNN ,  largely  based  on  the  choice  of  sim¬ 
ilarity  measurement  and  the  size  of  the  neighborhood,  presents  a  simple  but  limited 
portrait  of  the  correlations  between  samples. 

In  order  to  capture  the  discriminant  nature  behind  the  data,  our  work  focuses  on  de¬ 
signing  an  effective  construction  of  an  adaptive  neighborhood  on  which  multilabel  clas¬ 
sification  task  can  be  efficiently  carried  out.  By  adaptive,  we  mean,  this  neighborhood 
is  determined  by  the  natural  structure  behind  the  data  and  we  don’t  have  to  prescribe 
the  parameter  like  the  number  of  neighbors  K  or  a  specific  way  of  similarity  measure¬ 
ment.  Motivated  by  sparse  representation  in  face  recognition  [5],  we  summarize  this 
procedure  in  a  similar  optimization  problem  (Pan)- 


min  || z  -  Xw\\o  +  A||«’||i 

w 

s.t  w  >  0 


(4) 


A"  is  a  cl  by  n  matrix  whose  columns  contain  the  training  data  of  dimension  d.  z  is  a 
single  test  sample  and  our  goal  is  to  seek  the  sparsest  coefficient  w  while  keeping  the 
residual  as  small  as  possible.  This  formulation  is  able  to  capture  exactly  the  same  kind 
discriminant  nature  as  sparse  representation  stated  in  the  previous  section.  However, 
our  method  still  differs  from  sparse  representation  in  the  objective  function  and  the 
constraint  as  follows: 

-  Different  from  sparse  representation  which  aims  at  finding  a  sparse  solution  with 
best  reconstruction  results,  our  method  concerns  more  to  find  out  the  information 
of  neighborhood  in  which  the  nonnegativity  is  necessary. 

-  The  nonnegativity  constraint  can  prov  ide  us  a  straightforward  interpretation  of  the 
relation  between  the  test  sample  and  the  training  sample,  where  larger  value  of  w , 
means  that  the  ith  training  sample  is  ’'more  similar”  to  the  test  sample  c  and  vice 
versa. 

Based  on  the  facts  above,  we  claim  that  an  adaptive  neighborhood  for  each  test  sam¬ 
ple  is  obtained  by  noticing  that  we  don't  need  to  prescribe  any  concrete  way  of  sim¬ 
ilarity  measurement  between  samples  or  the  si/e  of  the  neighborhood.  Unlike  sparse 
representation's  choosing  class  with  least  residual  in  classification  [5],  we  design  the 
classifier  in  a  simpler  weighted  sum  way:  for  a  label  l  €  L  and  a  given  test  sample 
f(z.yi)  =  Ylj  wj  *  Yij'  Y  contains  the  true  label  of  training  data,  each  in  a  column. 
Algorithm  1  shows  the  the  complete  description. 
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3.3  Comparison  with  Previous  Work 

Compared  to  previous  the  state-of-the-art  works  like  ML-KNN  and  IBLR-ML ,  several 
remarkable  differences  should  be  emphasized  for  our  method  which  makes  multi  label 
classification  done  effectively  and  efficiently. 

First,  the  neighbors  chosen  by  our  algorithm  is  generally  different  from  that  of  KNN. 
Inherited  from  sparse  representation,  AN  tends  to  select  those  neighbors  that  share  the 
same  underlying  subspace,  as  can  be  seen  from  Figure  1 . 


Fig.  I.  Data  from  two  affine  subspace(y  =  5.0, y  =  6.0)  with  gaussian  noise  added.  The  solid  line 
shows  the  neighbors  selected  by  AN  and  the  dashed  line  gives  that  of  2-Nearest  Neighbors(2NN). 
2NN  selects  neighbors  with  least  distances  while  the  neighbors  chosen  by  AN  automatieally(with 
the  si/e  of  two  coincidentally)  tend  to  lie  on  the  same  subspace  which  arc  much  more  disenmk 
native  [5]. 


Second,  we  don’t  need  to  prescribe  the  size  of  the  neighborhood  A  as  in  ML-KNN 
and  IBLR-ML.  I\  is  set  to  the  number  of  nonnegative  elements  in  w  which  is  natu¬ 
rally  obtained  from  the  above  optimization.  Although  we  can  still  fix  the  value  of  A 
by  choosing  the  I\  largest  elements  of  w ,  it  is  advisable  that  different  samples  would 
belong  to  the  different  neighborhoods  which  have  different  sizes. 

In  addition,  due  to  the  natural  discriminant  property  of  sparsity,  no  further  com¬ 
plicated  classifier  is  required,  a  simple  weighted  sum  would  suffice.  This  makes  the 
classification  procedure  more  efficient. 

4  Experiments 

In  this  section,  experiments  are  conducted  on  public  multilabel  classification  data  sets, 
which  serve  both  to  demonstrate  the  efficacy  of  the  proposed  method  and  to  validate  the 
claim  we  have  made  in  the  previous  sections.  We  compare  our  results  with  the  state-of- 
the-art,  including  ML-KNN  and  IBLR-ML ,  of  which  the  implementations  are  provided 
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Algorithm  1.  Adaptive  Neighborhood 

Input: 

A':  training  data 
Y:  training  label  set 
2:  test  sample 
0:  threshold.  A:  rcgularizer 
Output: 

/:  classifier 

predict:  predicting  function 

Procedure: 

for  all  test  sample  z  do 

Solve  the  optimization  Problem 


\\z  -  Xw\\l  +  A||u’||i 


s.t.  w  >  0 


mm 


Normalize  w 


/(z.-)-Yw 


for  j  =  l  to  \L\  do 

if  f(z,  yj)  >  0  then 
predict  (z,yj)  —  1 

else 

predict  (z9pj)  —  —  1 

end  if 
end  for 

end  for 


by  their  original  authors.  Our  algorithm  can  be  efficiently  implemented  using  the  sparse 
learning  package  11  Is  1  orSLEP[14], 

4.1  Measurement 

Unlike  traditional  loss  function  of  single  label  classification,  special  criterion  should 
be  considered  while  evaluating  the  performance  of  multilabel  task.  Here  we  utilize  the 
measurements  that  provided  in  [3|. 

-  Hamming  Loss: 


-  OneError: 


One  Error 


"(f,x,Y)  =  ~  Yl larg ln^x f(*j •  v)  $  v'ii 


http://www.stanford.edu/  boyd/1 1  _ls/ 


1 
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-  Coverage: 


1  n 

Coverage(f.XjY)  =  —  max  rank  f  ( Xj ,  y )  —  1 


-  Ranking  Loss: 


rlossif , 


1 


I  { (l/i '  !/2)|/(^t)  2/i )  <  fi-Ti-y-z),  {yi  ,y*)  €  Yi  X  Vi}| 


-  Average  Precision: 


1  n  1 

AvePrec(f,x  Y)  =  -  J2  y.  ' 


E 

!/€>'* 


|{?/lra»A7(3-,-,j5//)  <  rankf(xi,y),y'  €  ^}| 
rankf(xi.y) 


The  operator  ®  in  Hamming  Loss  means  symmetric  difference  which  measures  the 
number  of  labels  that  we  have  misclassified  during  the  test  phase.  In  One  Error ,  function 
{•J  takes  1  if  the  parameter  it  takes  holds  true  and  the  whole  statistic  calculates  the  times 
the  label  we  classified  with  most  confidence  is  actually  incorrect.  Coverage  measures 
how  far  we  need  to  go  dow  n  the  label  list  to  cover  all  the  positive  label  and  Ranking 
Loss  provides  the  average  fraction  of  pairs  that  are  not  correctly  ordered.  Similar  to 
the  concept  of  precision  in  Information  Retrieval,  Average  Precision  gives  the  mean 
precision  on  every  label. 


Table  1.  Statistics  of  the  data  sets  used  in  the  experiments 


Data  Set  instance  attribute  label 


genbase 

662 

1186 

27 

medical 

978 

1449 

45 

enron 

1702 

1001 

53 

bibtex 

7358 

1836 

159 

4.2  Data  Sets 

Four  data  sets1:  genbase ,  medical ,  enron  and  bibtex  are  chosen  for  our  experiments. 
Data  set  genbase  is  derived  from  the  task  of  protein  classification  [15],  where  each 
protein  can  associate  with  at  most  27  labels.  The  data  set  contain  662  instances  of  1  1 85 
dimensions.  978  instances  of  dimension  1449  each  with  45  labels  arc  contained  in  the 
data  set  of  medical.  It  comes  from  the  international  challenge  of  classifying  clinical  free 
text  using  natural  language  processing,  which  aims  to  create  and  train  computational 
intelligence  algorithms  that  automate  the  assignment  of  ICD-9-CM  codes  to  clinical 

1  http://mlkd.csd.auth.gr/multilabel.html 
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free  text.  Data  set  enron  is  derived  from  the  UC  Berkeley  Enron  Email  Analysis  Project 
and  contains  Email  data  from  about  150  users,  mostly  senior  management  of  Enron. 
After  processing,  the  eurrent  data  set  is  comprised  of  1702  instances  with  the  dimension 
of  1001  and  53  labels  are  involved.  The  last  data  set  we  use  is  relative  large,  hibtex  was 
used  to  solve  the  automated  tag  suggestion  problem  ( 16],  containing  7395  instances  of 
1 836  dimension  with  1 59  labels.  An  overview  of  all  the  data  is  provided  in  Table  1 . 

4.3  Parameter  Setting 

As  pointed  in  the  previous  section,  K  Nearest  Neighbors  are  involved  in  both  algorithms 
of  ML-KNN  and  IBLR-ML.  In  their  experiments,  the  size  of  the  neighborhood  is  fixed 
at  10  by  whieh  positive  results  have  been  achieved.  We  also  use  this  value  in  our  exper¬ 
iments  for  fairness.  The  regularizer  A  in  algorithm  AN  should  also  be  earefully  chosen. 
Although  various  methods  have  been  proposed  to  deal  with  this  issue,  there  is  currently 
no  reliable  way  to  get  the  optimal  value.  Cross  validation  can  be  adopted  for  better  per¬ 
formance,  however,  that  would  be  time-consuming.  Therefore  we  simply  fix  A  at  1 .0  in 
all  our  experiments.  Actually  it  will  be  shown  in  our  experiments  that  a  small  change  in 
A  does  not  affeet  the  performance  much. 


Fig.  2.  Indexes’  values  of  AN  vs.  the  A  on  genbase:  small  A  tends  to  give  better  performance 
which  decreases  as  A  increases,  however,  a  small  change  at  its  manually-chosen  value(e.g.  1.0 
here)  does  not  affect  the  effieacy  much 


4.4  Experimental  Results  and  Analysis 

First,  we  test  the  stability  of  our  AN  algorithm  to  the  parameter  setting  of  A  by  assigning 
different  values  of  Aina  relative  large  range  on  the  data  sets  of  genbase,  as  shown 
in  Figure  2.  We  ean  see  that  the  a  small  value  of  A  tents  to  give  better  performance. 
This  can  be  explained  that,  as  A  increases,  the  optimization  will  exert  more  penalty  on 
the  sparsity  of  the  w.  A  very  large  A  would  typically  result  in  very  few  mimber(e.g. 
only  1)  of  neighbors  whieh  are  ehosen  for  further  classification,  whieh  yields  a  bad 
classification  results.  However,  it  ean  also  be  reeogni/ed  that,  for  small  value  of  A,  its 
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Table  2.  Comparative  Results  on  genba.se:  AN  achieves  the  best  performance  on  all  statistic 
except  for  Ranking  Loss.  IBLR-ML  gels,  belter  performance  in  all  statistic  than  ML-KNN. 


Algorithm 

AN 

ML-KNN  IBLR-ML 

HI, OSS  I 

0.0020 

0.0050 

0.0020 

ONEERROR  1  0.0056 

0.0090 

0.0070 

Coverage | 

0.3518 

0.5610 

0.4220 

RLOSS  | 

0.0058 

0.0060 

0.0040 

AvePrf.c  t 

0.9920 

0.9890 

0.9900 

Table  3.  Comparative  Results  on  medical:  AN  has  the  best  result  but  for  coverage  on  which  ML- 
kNN  gels  lhe  best  performance.  On  this  data  set,  IBLR-ML  was  surpassed  by  ML-KNN  in  all 
statistics. 


Algorithm 

AN 

ML-KNN  IBLR-ML 

HLOSS  | 

0.0165 

0.0171 

0.0223 

OneError  l  0.1381 

0.2643 

0.3844 

Coverage  j 

1.7177 

0.7237 

4.7960 

RLOSS  1 

0.0253 

0.0425 

0  0833 

AvePrec  t 

0.8876 

0.7957 

0.7045 

Table  4.  Comparative  Results  on  enron :  Except  for  Hamming  Loss,  AN  achieves  the  best  perfor¬ 
mance.  ML-KNN  outperforms  IBLR-ML  consistently  in  all  statistics. 


Algorithm 

AN 

ML-KNN  IBLR-ML 

HLOSS  | 

0.0540 

0.0520 

0.0572 

OneError  1 

0.3005 

0.3040 

0.3834 

Coverage  j 

12.8532 

13.2055 

14.9551 

RLOSS  l 

0.0891 

0.0938 

0.1  124 

AvePrec  f 

0.6598 

0.6232 

0.6020 

small  change  does  not  affect  the  performance  much.  Secondly,  we  compare  our  AN 
algorithm  with  the  ML-KNN  and  IBLR-ML  on  the  aforementioned  measurements.  The 
j  beside  each  measurement  means  that  smaller  value  yields  better  performance  while 
|  represents  the  opposite.  Table  2  shows  the  testing  results  on  genbase,  from  which  we 
can  see  that  AN  algorithm  dramatically  outperforms  the  other  methods  in  all  statistic 
except  for  Ranking  Loss ,  on  which  IBL  R-ML  achieves  the  best  result. 

Similarly,  Table  3  to  Table  5  give  the  effectiveness  of  the  three  algorithms  on  data  sets 
medical ,  enron ,  bibtex respectively.  From  the  experimental  results  we  ean  see  that  IBLR- 
ML  outperforms  ML-KNN  in  data  set  genbase  while  the  opposite  results  arc  achieved 
in  data  sets  medical  and  enron  and  none  is  guaranteed  better  than  the  other.  However, 
although  AN  does  not  posses  the  best  results  in  all  statistics,  it  still  can  be  recognized 
that  AN  dominates  the  experimental  results  and  outperforms  the  other  two. 
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Table  5.  Comparative  Results  on  bibtex :  AN  leads  in  all  statistics  and  significantly  improvement 
is  achieved  in  Ranking  Loss  and  Coverage 


Algorithm 

AN 

ML-KNN  IBLR-ML 

HI. OSS  [ 

0.0137 

0.0140 

0.0189 

OneError  i 

0.4064 

0.5853 

0.6294 

Coverage!  26.6282 

56.2179 

48.7797 

RLOSS j 

0.0896 

0.2173 

0.1961 

AvePrec  t 

0.5378 

0.3449 

0.3349 

5  Conclusion  and  Future  Work 

In  this  paper,  we  propose  an  Adaptive  Neighborhood  algorithm  for  multilabel  classifi¬ 
cation.  We  eonstruet  an  adaptive  neighborhood  by  an  optimization  procedure  similar  to 
sparse  representation  but  with  more  interpretability  of  relation  between  neighborhood. 
Based  on  this  automatically-formed  neighborhood,  classification  can  be  easily  carried 
out.  Experiments  show  our  algorithm  outperforms  the  state-of-the-art. 

Some  issues  of  this  framework  should  still  be  ameliorated  in  the  following  points 
which  w  ill  be  our  future  work: 

The  quadratic  programming  behind  the  algorithm  is  time  consuming.  Solving  the 
optimization  more  efficiently  ean  be  helpful. 

-  How  to  take  the  labels'  correlations  into  account  explicitly  under  the  AN  framework 
is  another  issue. 

-  Exploring  other  ways  to  classification  under  AN  other  than  our  current  weighted 
sum  method  is  desirable. 
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Abstract.  Behavior  analysis  received  much  attention  in  recent  year,  such  as 
customer-relationship  management,  social  security  surveillance  and  e-busincss. 
Discovering  high  impact-driven  behavior  patterns  is  important  for  detecting  and 
preventing  their  occurrences  and  reducing  resulting  risks  and  losses  to  our  so¬ 
ciety.  In  data  mining  community,  researchers  pay  little  attention  to  time-stamps 
in  temporal  behavior  sequences  (without  explicitly  considering  inherent  tempo¬ 
ral  information)  during  classification.  In  this  paper,  we  propose  a  novel  Temporal 
Feature  Extraction  Method  -  TFEM.  It  extracts  sequential  pattern  features  where 
each  transition  is  annotated  with  a  typical  transition  time  (its  duration  or  interval). 
Therefore  it  substantially  enriches  temporal  characteristics  derived  from  temporal 
sequences,  yielding  improvements  in  performances,  as  demonstrated  by  a  set  of 
experiments  performed  on  synthetic  and  real-world  datasets.  In  addition,  TFEM 
has  the  merit  of  simplicity  in  implementation  and  its  pattern-based  architecture 
can  generate  human-readable  results  and  supply  clear  interpretability  to  users. 
Meanwhile,  it  is  adjustable  and  adaptive  to  user’s  different  configurations,  allow¬ 
ing  a  tradeoff  between  classification  accuracy  and  time  cost. 


1  Introduction 

Behavior  analysis  [1,2]  is  increasingly  regarded  as  a  key  component  in  business- 
problem  solving.  Unlike  traditional  analytical  methods,  behavior  informatics  is  aimed 
at  discovering  high  impact  events  (i.e.  those  activities  associated  with  or  causing  a  spe¬ 
cific  impact  of  interest  to  the  business  world)  from  behavioral  data.  Discovering  high 
impact-driven  behavior  patterns  is  important  for  detecting  and  preventing  their  occur¬ 
rences  and  reducing  resulting  risks  and  losses  to  our  society,  such  as  earthquake  predic¬ 
tion,  epidemic  outbreak  monitoring,  market  surveillance,  fraud  detection  and  national 
security.  In  order  to  identify  high  impact  behavior  patterns,  the  usual  transactional  data 
needs  to  he  converted  into  behavioral  data,  which  is  organized  to  explicitly  present 
properties  associated  with  behavior  and  its  impact  on  business. 

A  typical  situation  of  recording  behavior  is  through  constructing  sequences  of  be¬ 
havior,  and  generating  so-called  sequential  data  Sequential  data  is  widely  seen  in  many 
applications,  including  business  applications  and  scientific  applications.  In  general,  se¬ 
quential  data  only  involves  the  ordering  relationship  existing  in  behavior  sequences. 
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Table  1.  An  example  dataset  of  sequences  with  timestamps 


ID 

tl 

h 

h 

U 

1 5 

to 

...  label 

Si 

a 

c 

(bd) 

c 

b 

(ac) 

Cl 

S‘2 

b 

a 

a 

a 

a 

b 

c2 

S3 

c 

a 

a 

a 

a 

(ac) 

c2 

S4 

a 

a 

c 

c 

b 

c 

Cl 

*5 

(abc) 

a 

b 

d 

e 

d 

Cl 

A  sequence  s*  collects  a  list  of  ordered  objects  en ,  Sj  =  {?i<e2. ..., c7,},  in  which 
en  =  {x\X2—Xq)  is  an  element  consisting  of  activities,  events  or  actions  in  the  be¬ 
havior  sequence,  and  xq  records  the  properties  or  items  associated  with  the  sequence 
itemset.  When  timestamps  (t i, tn)  are  added  to  their  corresponding  behavior  actions 
(e\ ,e*2,  ...,<?»)«  we  generate  temporal  sequences.  A  temporal  sequence  is  expressed  as 
Si  =  { (# i .  C\ ),  (#2?  c*2), ...,  (fn,  en)}  where  £(n_i)  <  tn.  In  the  real  world,  a  sequence 
of  behavior  often  incurs  certain  impact  on  business,  for  instance,  a  series  of  abnormal 
online  payments  incur  online  payment  fraud,  a  list  of  high  risk  terrorist  activities  may 
lead  to  an  eventual  disaster  to  the  society.  Let  C  =  c*i ,  C2, cm  represent  such  busi¬ 
ness  impacts,  cm  is  a  specific  class  of  impact,  for  instance,  high  risk  customers.  Table 
1  shows  an  example  of  five  sequences,  each  sequence  consists  of  a  list  of  actions  hap¬ 
pening  at  different  time  points.  At  some  time  point,  multiple  actions  co-occur,  such  as 
(t i,  55)  =  (a be).  Each  sequence  is  associated  with  a  business  impact  label,  for  instance, 
55  has  associated  label  cj .  In  practice,  quantitative  temporal  information  associated  with 
activities  is  helpful  for  distinguishing  high  impact  behavior  from  others.  We  call  such 
activities  time  sensitive.  Time-sensitive  behavior  is  widely  seen  in  many  applications. 
For  instance, 

-  Example  1 .  In  a  medical  diagnosis  and  symptom  analysis,  the  temporal  information 
is  crucial  for  doctors  to  accurately  diagnose  diseases.  For  instance,  H 1 N 1  influenza 
(Swine  flu)  has  a  rapid  onset  within  3-6  hours,  presenting  with  high  fever  (greater 
than  102  °F).  In  contrast,  such  sudden  fever  is  rare  with  a  common  cold.  This  exam¬ 
ple  shows  the  importance  of  considering  temporal  intervals  in  sequence  analysis. 

-  Example  2.  As  for  failure  detection  and  identification  in  assembly  line  systems, 
anomaly  can  be  detected  with  the  help  of  the  quantitative  temporal  intervals  be¬ 
tween  tasks.  For  example,  suppose  there  are  three  successive  workflow  tasks.  It  is  8 
minutes  from  task  1  to  task  2,  and  2  minutes  from  task  2  to  task  3.  If  a  record  shows 
2  minutes  from  task  1  to  task  2  and  6  minutes  from  task  2  to  task  3,  apparently  this 
may  indicate  the  presence  of  anomaly  even  though  the  sequence  representation  of 
those  tasks  present  nothing  abnormal.  This  example  shows  that  sequence  analysis 
without  considering  temporal  intervals  may  miss  important  findings. 

-  Example  3.  In  the  web  usage  analysis,  if  many  users  tend  to  stay  for  a  longer  time 
with  some  particular  websites  than  visiting  others,  the  browsing  duration  difference 
indicates  more  attractive  value  of  the  long-stay  websites.  This  example  shows  the 
importance  of  considering  user  navigation  duration  in  web  usage  analysis. 
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To  analyze  patterns  in  the  above  dataset  in  Table  1  and  applications,  traditional  se¬ 
quence  analysis  methods  only  count  the  ordering  information  among  sequential  items, 
and  treat  all  actions  equally  by  merging  them  together  For  instance,  a  health  insurant 
claims  one  to  multiple  service  types  at  the  same  time  with  increasing  frequencies  may 
indicate  either  increasingly  terrible  health  situation  or  fraudulent  claims.  Health  insur¬ 
ance  providers  may  be  interested  in  claim  review  and  active  customer  care,  so  as  to 
work  out  why  multiple  services  were  conducted  at  the  same  time,  whether  there  is  any 
service  of  the  patient’s  particular  interest,  why  the  patient  frequently  visited  doctors, 
or  whether  the  patient  saw  different  doctors.  While  these  questions  are  so  critical  for 
health  insurance  providers,  it  is  hard  for  the  existing  sequence  analysis  approaches  to 
find  informative  hints  for  these  questions. 

This  is  because  the  existing  sequence  analysis  approaches  mainly  focus  on  sequence 
items,  ordering  relationship.  Consequently,  important  information  in  temporal  sequence 
is  missing,  for  instance,  the  time  interval  between  two  consecutive  activities,  those  co¬ 
occurring  activities  at  the  same  time,  and  the  impact  label  associated  with  a  sequence. 
However,  these  aspects  are  critical  for  us  to  disclose  in-depth  causes  and  effects  asso¬ 
ciated  with  discriminative  behavior.  For  this,  both  temporal  sequence  analysis  and  tem¬ 
poral  sequence  classification  can  play  an  important  role.  Temporal  sequence  analysis  is 
an  emerging  research  issue  in  sequence  analysis.  Limited  research  has  been  conducted 
on  mining  sequential  patterns  from  temporal  sequential  data.  To  the  best  of  our  knowl¬ 
edge,  current  approaches  mainly  pay  attention  to  the  timestamps  associated  with  events, 
which  are  converted  into  sequential  orders  of  the  underly  ing  activities. 

In  addition,  while  sequence  classification  is  attracting  more  and  more  interest  [3], 
people  focus  on  the  combination  of  classification  with  traditional  sequential  pattern 
mining.  The  goal  of  sequence  classification  is  to  predict  which  class  a  given  sequence 
belonged  to.  No  substantial  work  has  been  found  on  classifying  temporal  sequences. 

Unfortunately,  how  to  handle  time  sensitivity  in  the  temporal  sequence  classification 
is  a  difficult  problem.  The  construction  of  the  sequence  of  items  should  be  intertwined 
with  the  construction  of  its  timestamps.  Historically,  researchers  independently  focus 
on  either  sequential  or  temporal  aspects.  How  to  combine  the  temporal  information 
with  sequence  classification  to  attain  an  enhanced  informative  model  is  nearly  unex¬ 
ploited.  In  addition,  it  is  very  time  consuming  to  identify  patterns  combining  temporal 
information  with  sequence  classification. 

In  this  paper,  we  discuss  temporal  sequence  classification.  The  main  idea  is  to  incor¬ 
porate  temporal  information  into  sequence  classification.  For  this,  we  propose  Temporal 
Feature  Extraction  Method  (TFEM)  to  mine  temporal  features  for  sequence  classifica¬ 
tion.  Our  contribution  is  two-fold. 

-  One  is  that  we  design  innovative  feature  mining  algorithms  which  can  effectively 
represent  temporal  information  for  sequences  classification.  The  time-sensitive 
features  enrich  temporal  characteristics  derived  from  the  raw  data,  yielding  im¬ 
provement  on  sequence  classification  performance,  as  demonstrated  by  a  set  of 
experiments  performed  on  synthetic  and  real-world  datasets. 

-  The  other  is,  our  result  is  easily  interpretable.  We  employ  decision  tree  to  generate 
human-friendly  rules.  Additionally,  it  provides  an  adaptive  solution  allowing  user 
to  determine  a  tradeoff  between  classification  accuracy  and  computational  cost. 
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The  rest  of  the  paper  is  organized  as  follows.  Seetion  2  summarizes  related  work. 
Seetion  3  introduces  our  novel  TFEM  approach  of  mining  time-sensitive  features  for 
sequence  classification.  Seetion  4  presents  two  empirical  studies  in  whieh  we  applied 
our  method  to  synthetic  and  real-world  datasets.  Seetion  5  diseusses  an  extension  of  our 
TFEM  approach.  Finally  we  eonelude  our  work  in  section  6. 


2  Related  Work 

Temporal  sequence  mining  has  been  explored  intensively.  Based  on  the  nature  of  items, 
sequences  can  be  divided  into  two  categories:  symbolic  representation  (discrete  vari¬ 
able  e.g.  an  action  eode,  or  tiek-by-tick  data)  and  time-series  representation  (continuous 
variable  e.g.  price  in  the  stoek  market).  Here  we  focus  on  symbolics  as  there  are  multi¬ 
ple  approaches  to  eovert  time-series  data  into  symbol ies:  for  instance,  Discrete  Fourier 
transform  (DFT)  [4], Singular  Value  Decomosition(SVD)  [5], Adaptive  Pieeewise  con¬ 
stant  approximation  [6],Symblie  Aggregate  Approximation(SAX)  [7]. 

There  are  enormous  renowned  classification  algorithms.  However,  they  are  difficult 
to  apply  to  sequential  data,  because  there  could  be  huge  features  potentially  and  thus 
intractable  for  relatively  limited  computing  resources.  In  a  seminal  paper,  Lesh  ete.  [8] 
proposed  FeatureMine  for  sequence  classification  by  analysing  the  presence  of  features 
derived  from  discriminative  frequent  patterns.  The  three  phases  of  Lesh's  method  are: 

1 .  Mining  features.  First  of  all,  it  adapts  SPADE  [9]  to  generate  frequent  patterns  from 
sequence  data.  Chi-square  tests  are  used  to  prune  patterns  to  enforce  discriminative 
and  redundancy  constraints.  Remaining  patterns  f\ ,  /2, /„  are  outputted  as  fea¬ 
tures  for  classification. 

2.  Applying  features  to  sequences.  Most  standard  classifiers  only  aeeept  an  example 
as  input  when  it  is  in  the  form  of  a  veetor  consisting  of  feature-value  pairs.  Each 
feature  generates  a  boolean  value  depending  on  its  presence  in  a  sequence.  For 
example,  if  sequence  Si  is  “in  presence  of’  pattern  f\  (i.e.,  f\  is  an  subsequenee  of 
Si),  the  value  with  regard  to  feature  f\  is  true,  otherwise  it  is  false. 

3.  Classification.  Based  on  the  boolean  feature-value  pairs,  traditional  attribute-based 
classifiers  ean  be  used,  such  as  Winnow  and  Naive  Bayes. 

After  that,  [10,1 1,12]  incorporate  biological  knowledge  into  DNA  sequence  classifi¬ 
cation.  Recently,  there  are  overwhelming  tools  on  protein  sequences  [13,14,15].  [16] 
uses  implicit  motif  distribution  based  hybrid  computational  kernel  for  sequence  classi¬ 
fication  But  to  our  best  knowledge,  combining  sequence  classification  with  temporal 
information  is  nearly  unexploited. 


3  A  Novel  TFEM  Approach 

As  diseussed  in  previous  seetion,  most  existing  sequence  classification  approaches  sel¬ 
dom  explicitly  take  time  intervals  between  items  into  consideration.  To  address  this 
limitation,  we  propose  temporal  feature  extraction  method  (TFEM)  to  eapture  the  inter¬ 
val  characteristic. 
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Fig.  1.  A  timeline  representation  of  partial  sequence  .si  from  t  \  to  1. 1 


Definition  \.A  behavior  sequence  i s*,  =  {ci ,  e2 . rn},  in  which  e„  —  (a*  1.7:2... .7^)  is 

an  atomic  item  consisting  of  activities ,  events  or  actions  in  the  behavior  sequence,  and 
.rq  records  the  properties  or  items  associated  with  the  sequence  itemset.  if  q  —  1,  cn  is 
a  single  atomic  item,  otherwise  it  is  composite  atomic  item. 

Definition  2.  For  an  atomic  item  cu,  £c[en],  /a[en]  denote  the  time  stamps  of  current 
item  and  previous  item  a  sequence  s,,  respectively.  In  particular,  for  the  first  item  in  sl% 
tu  =  /o,  which  is  a  reference  time  or  start  point  for  calculation. 

Definition  3.  If  a  pattern  p  contains  only  one  atomic  item,  p  is  called  I -itemset;  other¬ 
wise  we  name  the  first  item  in  p  as  p  first  item  nnd  the  last  itetn  as  piastitrm .  An  interval 
for  pattern  p  in  sequence  st  is  defined  as 


p  is  1 -itemset. 
Otherwise. 


0) 

(D 


If  pattern  p  repeats  in  .sq,  an  average  value  is  taken  when  calculating 

An  example  of  calculating  intervals  within  .sq  from  t\  to  £4  is  depicted  in  Fig.  I  For 
instance,  for  1 -itemset  {a},5?{a)  =  t\  —  to.  For2-itcmsct  {{bd)yc} q)fjrstitrm  is  { ( bd ) } 
and  piastitcm  is {c} ,  Therefore,  V{(w),c}  =  U  —  t2-  Again  for  I -itemset  pattern  {c},it 
occurs  twice  in  s\.  For  the  first  presence  of  {c } ,  the  interval  —  f2  t\  and  for  the 
second  presence,  ftVj  =  Ia  ~  h-  r }  =  (^{cj  +  ffi"c})/2  =  (*4  -  h  4-  t2  -  *i)/2. 

The  basic  idea  of  our  TFEM  approach  is  during  the  traditional  feature  extraction  for 
sequence  classification,  wc  incorporate  interval  information  to  create  more  informative 
features  and  thus  classifier  can  take  advantage  of  those  constructed  new  TFEM  features. 

3.1  Framework 

The  dataflow  of  our  TFEM  sequence  classification  is  described  in  Figure  2.  The  whole 
process  is  divided  into  three  phrases: 

-  Data  Representation  and  Preprocessing:  First  of  all,  sequential  pattern  mining 
algorithm  is  employed  to  get  initial  features  (Basically  they  arc  frequent  patterns 
extracted  from  raw  data  and  have  been  pruned  by  statistical  tests).  Then  we  cal¬ 
culate  an  interval  for  each  pattern  in  each  sequence.  Thus  we  can  generate  2-tuplc 
(pattern,  interval)  pairs  for  every  sequence. 
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big.  2.  A  dataflow  of  behav  ior  sequence  classification 


-  Feature  Mining:  The  TFEM  algorithm  in  section  3.3  is  designed  to  construct  new 
temporal  features  for  sequence  classification. 

-  Training  and  Testing:  10  fold  cross-validation  is  conducted.  Decision  tree  classi¬ 
fier  is  used  to  generate  easily  interpretable  rules.  Then  the  trained  classifier  makes 
predictions  on  incoming  sequences. 


3.2  Data  Representation  and  Preprocessing 

By  using  featureMine  proposed  by  Lesh  etc.  [8],  we  attain  patterns  {a},  {(or)},  {?>}, 
{a,.b}' ...  as  our  initial  features  in  the  previous  example.  Then  for  each  pattern  /;  in 
every  sequence  we  calculate  its  interval  using  formula  1  and  generate  2-tuple  (pattern, 
interval)  pair,  which  is  shown  in  table  2. 


Table  2.  An  example  dataset  in  (pattern,  interval)  pairs 


ID 

W 

{t} 

{«,ft} 

{(«r)} 

.si  1  1  2  1 

5*2 

1 

1 

3 

1 

1 

3 

3 

54 

1 

1 

2 

55 

1 

1 

2 

1 

3.3  Feature  Mining 

Construction  of  Temporal  Features.  We  design  TFEM  temporal  feature  algorithm  to 
construct  new  temporal  features,  which  is  described  in  Fig.  3. 
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Fig.  3.  A  example  of  temporal  feature  construction 


Algorithm  1:  Temporal  Feature  Extraction  Algorithm 


Input:  MinJreq(ct), Dataset  D. 
Output:  Candidate  temporal  features. 


(a) .  Represent  data  as  2-tuple  (pattern,  interval)  pairs  in  the  two  dimensional  feature  space. 

(b) .  Merge  and  cluster.  In  order  to  reduce  the  feature  space,  for  the  same  pattern  p,  intervals  are 

merged  if  they  belong  to  the  same  class.  For  example,  if  any  feature  examples  generated 
in  previous  step  from  (pattern  1,  8)  to  (pattern  1,  10)  are  all  positive,  they  can  be  merged 
as  (pattern  1,  8^10).  For  those  belonging  to  multiple  classes,  we  adopt  an  odds-ralio  test 
and  simply  prune  points  which  are  less  skewed  in  the  class  distribution.  For  example,  in 
two-classes  classification,  we  calculate  the  discriminative  power  by  the  following  formula: 


v  i/n  -  p}) 

P2/0  -p 2) 


(2) 


where  p  1  ,/>2  are  proportions  of  a  pattern  in  difference  classes  respectively.  Divide  our  (pat¬ 
tern, interval)  space  into  several  regions  by  clustering  It  is  shown  that  there  are  three  regions 
in  Fig.  3 

(c).  Output  region  boundaries  as  candidate  temporal  features.  In  our  example,  the  three  regions 
are  our  new  ly  constructed  temporal  features. 


The  next  step  is  to  make  use  of  these  regions.  For  an  incoming  sequence,  we  check  every 
pattern’s  presence.  If  the  pattern  occurs  then  calculate  its  interval  value  and  locate  its 
point  in  (pattern,  interval)  two-dimension  feature  space.  The  temporal  feature  value  is 
true  or  false  depending  on  which  region  it  falls  in. 

Temporal  Feature  Selection.  After  conslrueting  new  temporal  features,  statistical  op¬ 
timization  is  performed  in  order  to  achieve  highly  efficient  classification.  There  are 
three  pruning  criteria  in  our  algorithm: 

1 .  Features  should  be  frequent  and  with  strong  discriminative  power. 

2.  Features  should  be  efficient  for  classification. 

3.  Features  should  be  optimized,  without  eomplex  parameter  tuning. 
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This  process  is  described  in  algorithm  2. 


Algorithm  2:  Temporal  Feature  Mining  Algorithm 


InputDataset  D  in  the  form  of  (pattern, interval)  pair. 
Output:  Temporal  features. 


(a) .  Generate  candidate  features  by  previous  feature  extraction  algorithm 

(b) .  Prune  any  candidate  if  it  meets  any  criterion  in  the  following  tests: 

-  Discriminative  test:  The  odds-ratio  test  is  employed  to  ensure  features  are  significantly 
discriminative  among  classes. 

-  Redundancy  test:  We  create  new  calculation  formula  based  on  Foil-Gain  [  1 7]  to  estimate 
information  gain.  For  instance,  regarding  to  biclassification 


E  =  Max(tw(\og2 


V  i 


Pi  T  Tlj 


-  log2 


P2 


Pi  +  m 


)) 


(3) 


where  pu7i\  is  that  number  of  positive  and  negative  examples  covered  before  adding 
new  feature.  ;>2,  n-2  is  that  number  of  positive  and  negative  examples  covered  when 
adding  one  new  feature  t  is  the  number  of  positive  examples  covered  by  both,  w  is  the 
proportion  of  pattern’s  duration  time  in  global  temporal  dimension. 

-  Optimization  test:  We  tunc  our  model  by  enumerate  parameters’  thresholds.  For  in¬ 
stance,  the  threshold  for  pattern’s  length  can  be  determined  by  simply  the  trial  and  error 
method,  that  is,  running  our  tests  with  different  length  and  selecting  the  best. 

(c).  Output  newly  constructed  features  after  pruning  in  step  b. 


3.4  Training  and  Testing 

We  choose  a  rule-based  classification  method  for  several  reasons.  First,  it  generates 
human-readable  results.  This  is  very  important  for  the  interpretability  of  our  model  in 
practice.  Secondly,  it  is  efficient.  The  time  complexity  is  O(N)  while  N  is  the  number 
of  rules.  Finally,  with  respect  to  imbalance  data,  rule-based  learner  is  more  effective. 

Based  on  our  temporal  features,  classifier  can  improve  its  accuracy  as  those  con¬ 
structed  features  help  to  capture  informative  temporal  characteristics  in  the  raw  data. 

4  Empirical  Studies 

In  order  to  evaluate  our  methods,  we  implement  TFEM  in  both  symbolic  sequences  and 
time-series  datasets. 

4.1  Health  Insurance  Dataset 

We  use  a  health  insurance  dataset  to  test  our  TFEM  framework,  which  describes  ev¬ 
ery  member’s  (or  user’s)  claim  history.  In  our  experiment,  there  are  a  total  of  15875 
records  from  479  users.  Each  record  is  in  the  format  of  4-tuple  vector  (member  Jd,  ser¬ 
vice  _date,  service  .code,  server  .content).  We  reorganize  the  data  into  sequences  based 
on  the  attribute  of  member  Jd  in  a  temporal  order.  This  dataset  contains  a  sample  of 
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Table  3.  Traditional  sequence  classification  confusion  matrix 


accuracy:  76.41  % 

lrue  high-risk 

true  low-risk 

class  precision 

pred.  high-risk 

198 

71 

73  61% 

pred.  low-risk 

42 

168 

80.00  % 

class  recall 

82.50  % 

70  29% 

Table  4.  TFEM  sequence  classification  confusion  matrix 


accuracy:  83.11  % 

lrue  high-risk 

lrue  low-risk 

class  precision 

pred.  high-risk 

183 

24 

88.41  % 

pred.  low-risk 

57 

215 

79.04% 

class  recall 

76.25  % 

89.96% 

479  sequences  with  unequal  length  Eaeh  sequence  depicts  a  member’s  claim  history. 
Besides,  each  sequence  in  the  training  set  has  been  labeled  as  either  "high-risk”  or ‘low- 
risk”.  Table  1  shows  a  sample  of  our  dataset.  For  privacy  preserving,  a,  />,  c  denote  the 
abstraction  of  aetions  in  eaeh  real-world  sequence  reeord.  c \  represents  high-risk  class 
label  while  C2  is  low-risk  class  label.  Apparently,  two  items  may  happen  in  the  same 
time.  For  example,  in  sequence  ,  a  and  e  are  both  associated  with  time-stamp  t^. 

Our  algorithms  are  developed  by  Java  1 .6,  under  Eclipse  3.2  environments.  Hardware 
of  our  computer  is  duo-core  Intel  Pentium  4.2  w  ith  1 .5  G  memory. 

We  eonduet  sequence  classification  on  the  insurance  data.  After  frequency  pattern 
mining  phrase,  we  obtain  80  features  with  min_support=48.  The  art  for  choosing  an 
appropriate  minjmpport  threshold  is  to  make  sure  our  feature  set  is  neither  too  big  nor 
too  small.  In  this  discriminative  test,  the  parameter  value  of  odd-rate  is  2.  We  use  10-fold 
eross  validation  and  ealeulate  classification  accuracy.  Table  3  describes  the  performance 
of  Lesh's  method  as  a  benchmark.  By  comparison,  table  4  shows  the  performance  of 
TFEM  model.  From  the  performance  contrast  test,  we  can  see  the  TFEM  framework 
ean  increase  the  accuracy  from  76.41%  to  83.1 1%. 

4.2  Ionosphere  Dataset 

The  ionosphere  dataset  is  downloaded  from  UCI  KDD  repository  1 18].  The  time-series 
data  was  collected  by  a  system  in  Goose  bay,  Labrador.  There  are  two  classes  in  a  total  of 
351  samples.  After  converting  those  time  series  data,  we  run  traditional  frequent  pattern 
based  sequence  classification  and  our  TFEM  approach.  The  result  shows  TFEM  outper¬ 
forms  its  conventional  counterpart  with  an  increase  in  aeeuracy  from  76. 1 3%  to  8 1 .09%. 

4.3  Effects  of  Varying  Odds-Ratio 

Fig.  4  shows  comparison  of  traditional  method  and  TFEM  under  several  odds-ratio  pa¬ 
rameter  settings.  We  adjust  different  odds-ratio  and  measure  the  accuracy  and  time-cost. 
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Fig.  4.  Accuracy  vs.  time-cost 


It  is  observed  that  the  greater  the  value  of  odds-ratio  parameter  is,  the  more  candidate 
features  pruned,  which  reduces  overall  time-cost.  On  the  other  hand,  higher  accuracy 
will  lead  to  longer  feature  extraction  time.  Flexibility  is  offered  with  a  tradeoff  between 
classification  accuracy  and  time-cost. 


5  Discussion 

In  this  section,  we  first  employ  PCA  [19]  to  reduce  the  computation  cost  in  our  al¬ 
gorithms  and  make  TFLM  more  efficient.  Then  wc  discuss  about  handling  time-series 
data. 

Principal  component  analysis  (PCA)  describes  a  mathematical  procedure  that  trans¬ 
forms  a  number  of  possibly  correlated  variables  into  a  smaller  number  of  uncorrelated 
variables  called  principal  component.  PCA  was  first  invented  in  1901  by  Karl  Pearson 
[20],  PCA  [21,22]  is  mathematically  an  orthogonal  linear  transformation  that  trans¬ 
forms  data  to  a  new  coordinate  system.  As  you  can  see  from  our  insurance  experiment, 
there  arc  23  features.  In  some  cases,  in  order  to  find  better  fine  granularity  for  frequent 
patterns,  we  may  end  up  with  hundreds  of  features.  Therefore,  PCA  is  used  to  optimize 
our  model.  Fig.  5  depicts  the  cumulative  proportion  of  variance.  In  this  way,  the  number 


Fig.  5.  Principal  components  analysis  and  shift  to  symbolic  events 
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of  features  can  be  significantly  reduced  and  only  the  most  representative  instances  are 
kept. 

Fig.  5  also  shows  how  to  convert  continuous  variables  into  the  symbolic  representa¬ 
tion.  This  method  is  based  on  Yi  and  Faloutsos  and  Keogh  et  al/s  Piecewise  Aggregate 
Approximation  (PAA)  [23).  In  PAA,  each  record  of  time  series  data  is  divided  into  k 
segments  with  equal  length  and  the  average  value  of  each  segment  is  used  as  data- 
reduced  representation.  Obviously  the  PAA  model  is  very  straightforward  and  easy  to 
implemented.  It  is  very  fast  and  has  almost  linear  time  complexity.  But  on  the  other 
hand,  it  may  lose  useful  information  and  a  variable  indicating  the  slope  in  each  segment 
becomes  useful  during  the  conversion  proeess. 


6  Conclusion 

Quantitative  temporal  information  associated  with  activities  is  helpful  for  distinguish¬ 
ing  high  impact  behavior  from  others  in  many  business  problem-solving.  In  this  paper, 
we  proposed  a  novel  temporal  feature  extraction  for  behavior  sequence  classification. 
TFEM  incorporates  time  intervals,  which  arc  critical  in  many  business  applications, 
into  behavior  sequence  classification.  With  informative  features,  experiments  show  the 
performance  of  classifier  is  significantly  improved. 

TFEM  is  of  great  significance  for  discovering  knowledge  from  time-sensitive  behav¬ 
ior  sequences.  Furthermore,  it  is  important  to  note  that  TFEM  can  be  easily  extended 
to  handle  other  characteristics  without  being  limited  to  temporal  dimension,  such  as 
spatial  space. 
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Abstract.  Discovering  and  tracking  of  spatio-t  emporal  patterns  in  noisy 
sequences  of  events  is  a  difficult  task  that  has  become  increasingly  perti¬ 
nent  due  to  recent  advances  in  ubiquitous  computing,  such  as  communit  y- 
based  social  networking  applications.  The  core  activities  for  applications 
of  this  class  include  the  sharing  and  notification  of  events,  and  the  im¬ 
portance  and  usefulness  of  these  functionalites  increases  as  event-sharing 
expands  into  larger  areas  of  one’s  life.  Ironically,  instead  of  being  help¬ 
ful,  an  excessive  number  of  event  notifications  can  quickly  render  the 
functionality  of  event-sharing  to  be  obtrusive.  Rather,  any  notification 
of  events  that  provides  redundant  information  to  the  application /user 
can  be  seen  to  be  an  unnecessary  distraction.  In  this  paper,  we  introduce 
a  new  scheme  for  discovering  and  tracking  noisy  spatio-temporal  event 
patterns,  with  the  purpose  of  suppressing  reoccurring  patterns,  while 
discerning  novel  events.  Our  scheme  is  based  on  maintaining  a  collection 
of  hypotheses,  each  one  conjecturing  a  specific  spatio-temporal  event 
pattern.  A  dedicated  Learning  Automaton  (LA)  the  Spatio-Temporal 
Pattern  LA  (STRLA)  is  associated  with  each  hypothesis.  By  process¬ 
ing  events  as  they  unfold,  we  attempt  to  infer  the  correctness  of  each 
hypothesis  through  a  real-time  guided  random  walk.  Consequently,  the 
scheme  we  present  is  computationally  efficient,  with  a  minimal  memory 
footprint.  Furthermore,  it  is  ergodie  allowing  adaptation.  Empirical  re¬ 
sults  involving  extensive  simulations  demonstrate  the  STPLA's  superior 
convergence  and  adaptation  speed,  as  well  as  an  ability  to  operate  suc¬ 
cessfully  with  noise,  including  both  the  erroneous  inclusion  and  omission 
of  events.  Additionally,  the  results  included  which  involve  a  so-called 
“ Presence  Sharing"  application,  are  both  promising  and  in  our  opinion, 
impressive.  It  is  thus  our  opinion  that  the  proposed  STRLA  scheme  is, 
in  general,  ideal  for  improving  the  usefulness  of  event  notification  and 
sharing  systems,  since  it  is  capable  of  significantly  robustly  and  adap¬ 
tively  suppressing  redundant  information. 
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1  Introduction 

Presence  Sharing  is  a  ubiquitous  service  in  which  distributed  mobile  devices 
periodically  broadcast  their  identity  via  short-range  wireless  technology  such  as 
BlueTooth  or  WiFi  [1].  The  whole  problem  of  Presence  Sharing  is  intricately 
bound  to  the  issue  of  the  recording  and  processing  of  “events”  involving  the 
entities  included  within  the  social  network.  Applications  that  utilize  Presence 
Sharing  have  been  used  in  social  contexts  to  maintain  an  “in  touch”  feeling 
strengthening  social  relations  [2],  as  well  as  in  work  environments  to  enhance 
collaboration  between  colleagues  [3]. 

Typically,  “events”  occurring  in  the  real  world  can  be  characterized  as  being  in 
one  of  two  classes,  i.c..  “Stochastically  Episodic”  (SE)  and  “Stochastically  Non- 
Episodic”  (SNE).  This  is  a.  distinction  that  is  especially  pertinent  in  simulation, 
where  it  is  customary  for  one  to  model  the  behaviour  of  accidents,  telephone 
calls,  network  failures  etc.  using  their  respective  probability  distributions,  even 
though  they  follow  no  known  pattern.  Indeed,  events  of  these  families  happen 
all  the  time,  and  so  can  be  termed  as  being  “stochastically  lion-episodic”.  As 
opposed  to  this,  there  is  a  whole  class  of  events  that  can  stochastically  occur 
in  a  non- anticipated  manner.  These  so-called  “stochastically  episodic”  events 
include  earthquakes,  nuclear  explosions  etc.  The  difficulty  with  modelling  SE 
events  is  that  most  of  the  observations  appear  as  noise.  However,  when  the  SE 
event  does  occur,  its  magnitude  and  features  far  overshadow  the  background, 
as  one  observes  after  a  seismic  event.  The  modelling  and  simulation  of  such  SE 
events  in  the  presence  of  a  constant  stream  of  SNE  events  is  a  relatively  new 
field  [4,5],  where  the  authors  model  the  SE  and  SNE  events  simultaneously  in 
such  a  way  that  the  effect  of  an  SE  event  is  perceived  through  the  “lens’*  of  the 
underlying  background  of  SNE  events. 

Since  events  are  almost  omnipresent,  one  has  to  consider  the  observation  due 
to  Garlaii  ct  al.  [G],  who  state  that  the  most  precious  resource  in  a  computer 
system  is  no  longer  its  processor,  memory,  disk,  or  network,  but  rather  human 
attention.  Thus,  our  aim  in  this  paper  is  to  address  a  fundamental  challenge 
concerning  the  above  class  of  applications:  How  can  one  harvest  the  benefit  of 
event-sharing  without  distracting  the  application  user  with  redundant  notifica¬ 
tions ?  The  solution  we  propose  is  to  try  to  discern  the  nature  of  the  events 
encountered1.  Of  course,  the  events  may  not  be  drastically  SE  or  SNE,  as  in  the 
case  of  earthquakes  or  nuclear  explosions.  However,  if  we  can  discern  that  an 
event  is  repeating  (even  though  this  repetition  is  non-periodic),  it  is  still  of  a 
SNE  nature  which  must  be  given  less  weight,  while  non-repeating  events  (which 
are  in  one  sense,  SE)  must  be  assigned  a  greater  weight.  Thus,  the  question 
wo  resolve  involves  denioiist rating  how  we  can  enhance  the  Presence  Sharing 
experience  by  weighting  the  SE  and  SNE  events  appropriately. 

1  To  exemplify  the  usefulness  of  such  a  strategy,  consider  the  nuisance  caused  by  being 
notified  every  time  one  meets  a  colleague  at  work,  which  is  a  repeating  pattern,  or  a 
SNE  event.  In  contrast,  it  would  be  far  more  useful  to  be  promptly  notified  whenever 
the  same  colleague  unexpectedly  appears  in  your  vicinity  after  a  travel  abroad.  This 
would  be  non-repeating  pattern,  or  an  SE  event. 
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1.1  Related  Work 

A  number  of  earlier  studies  have  investigated  techniques  for  discovering  the  pe¬ 
riodicity  of  time  patterns,  such  as  the  episode2  discovery  algorithm  found  in 
[7].  However,  episode  discovery,  and  other  related  approaches,  suffer  from  the 
limitation  that  they  assume  unperturbed  patterns  that  exhibit  an  exact  peri¬ 
odicity.  Unfortunately,  the  real-life  unfolding  of  events  is  typically  noise  ridden. 
On  the  one  hand,  regular  events  may  get  cancelled,  introducing  what  we  define 
as  omission  noise ,  and  on  the  other,  events  may  arise  spontaneously  and  unex¬ 
pectedly,  without  being  part  of  a  periodic  pattern,  introducing  inclusion  noise. 
A  pioneering  work  which  was  reported  in  [8],  introduced  the  concept  of  off-line 
mining  of  partially  periodic  events.  Nevertheless,  deciding  whether  to  suppress 
event  notifications  must  often  he  done  instantaneously,  as  the  events  are  un¬ 
folding.  Indeed,  we  argue  that  any  realistic  scheme  should  discover  and  adapt 
to  patterns  as  they  appear  and  evolve  in  an  on-line  manner,  without  relying  on 
extensive  off-line  data  mining. 

1.2  Paper  Contribution  and  Organization 

The  paper  is  organized  as  follows.  In  Section  2,  we  present  our  overall  approach 
to  on-line  discovery  and  tracking  of  spatio-temporal  event  patterns,  in  which  the 
so-called  Learning  Automata  (LA)  plays  a  crucial  role.  The  scheme  is  designed  to 
deal  with  noisy  spatio-temporal  event  patterns,  when  event  patterns  are  evolving 
with  time.  We  continue  in  Section  3  by  evaluating  our  scheme  using  an  extensive 
range  of  static  and  dynamic  noisy  event  patterns.  The  experiments  demonstrate 
the  scheme’s  superior  convergence  and  adaptation  speed,  as  well  as  an  excellent 
ability  to  operate  successfully  with  noise,  including  both  erroneous  inclusion 
and  omission  of  events.  In  order  to  highlight  the  applicability  of  our  scheme,  wo 
present  a  “Presence  Shilling"  application  prototype  in  Section  4  where  we  also 
summarize  some  initial  user  experiences.  Finally,  Section  5.  concludes  the  paper 
and  also  provide  pointers  for  further  work. 

2  On-Line  Discovery  and  Tracking  of  Spatio-temporal 
Event  Patterns 

The  method  which  we  propose  is  based  on  the  theory  of  LA.  Since  space  does  not 
permit  a  detailed  overview  of  t  his  theory,  this  is  included  elsewhere  [9].  However, 
in  all  brevity,  we  state  that  our  scheme  is  based  on  maintaining  a  collection 
of  hypotheses,  each  one  conjecturing  a  .specific  spatio-temporal  event  pattern. 
A  dedicated  LA.  which  we  coin  the  Spatio-Temporal  Pattern,  LA  (STPLA).  is 
associated  with  each  hypothesis.  The  STPLA  decides  whether  its  corresponding 
hypothesis  is  true  bv  observing  events  as  they  unfold,  processing  evidence  for 
and/or  against  the  correctness  of  the  hypothesis.  To  explain  this,  we  first  address 
hypothesis  management,  and  then  proceed  with  the  details  of  the  STPLA. 

2  The  expression  “episode**  used  in  this  setting  must  not  be  confused  with  the  class 
of  SE  and  SNE  events  described  in  the  earlier  paragraph. 
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2.1  Hypothesis  Management 

The  premise  of  our  discussions  is  the  following:  In  order  to  reduce  distraction, 
events  should  only  he  signalled  when  they  are  SE.  This  means  that  they  cannot 
be  anticipated,  obey  no  known  stochastic  distribution,  and  possess  an  element  of 
“surprise”,  i.e.,  they  can  not  be  easily  predicted  by  the  recipient3.  An  event  can 
either  be  sporadic,  arising  spontaneously,  or  it  can  be  part  of  a  spatio-temporal 
pattern,  making  it  occur  regularly.  In  either  case,  if  it  cannot  be  explained  by  any 
of  the  spatio-temporal  patterns  that  are  known  by  the  recipient,  the  recipient 
should  be  notified.  However,  when  the  event  constitutes  a  part  of  an  ongoing 
spatio-temporal  pattern,  it  is  really  lion-episodic  (or  SNE)  in  nature.  We  require 
that  this  phenomenon  be  discovered  as  soon  as  possible,  so  that  the  events 
generated  from  this  pattern  can  be  suppressed  before  the  pattern  loses  its  novelty 
to  the  recipient. 

In  our  proposed  scheme,  when  an  event  is  observed,  all  potentially  inter¬ 
esting  patterns  that  could  have  produced  the  event  are  identified.  We  refer  to 
these  potential  patterns  as  hypotheses.  The  reader  will  thus  observe  that  our 
approach  is  based  on  the  concept  of  predefined  pattern  structures,  as  advo¬ 
cated  in  [10] .  rather  than  trying  to  look  for  patterns  with  unknown  structure. 
Thus,  in  this  spirit,  we  consider  a  discrete  world  of  rn  spatial  location  primi¬ 
tives  L  =  - , /m}  and  of  n  discrete  time  primitives  T  =  {£j, •  •  •  ? 

of  appropriate  granularity.  By  way  of  example,  the  location  primitives  could  be 
“Home”,  “Office”,  or  “Abroad”,  while  the  time  primitives  could  be  “Mondays”, 
“Tuesdays”,  “Weekends”,  and  so  on.  The  location  and  time  primitives  are  com¬ 
bined  from  their  cross-product  spaces  to  produce  spatio-temporal  patterns.  Thus, 
the  resulting  spatio-temporal  pattern  space  would  (or  could)  be  an  exhaustive 
enumeration  of  relevant  combinations  such  as  “Mondays  at  Office”,  “Weekends 
at  Office’1,  and  so  on.  Each  spatio-temporal  pattern  of  the  latter  form  is  seen 
as  a  hypothesis,  conjecturing  that  the  respective  pattern  specifies  an  ongoing 
stream  of  events.  In  the  following,  we  assume  that  there  are  r  such  hypotheses, 
represented  as  a  set  H  =  , /ir}*  Observe  that  although  the  cardinal¬ 

ity  of  this  set  might  get  large,  the  computational  efficiency  and  small  memory 
footprint  of  our  LA  (as  seen  presently),  effectively  handles  the  size  of  the  state 
space. 

Note  too  that  the  novelty  of  this  present  work  is  not  the  above  indicated 
structuring  of  the  spatio-temporal  pattern  space,  which  is  a  well-known  approach 
used  in  typical  calendar  systems.  Rather,  it  is  the  learning  scheme  we  propose4 
for  determining  whether  a  given  spatio-temporal  event  pattern  can  be  found  in 
a  stream  of  events,  in  an  on-line  manner,  and  under  noisy  conditions. 

3  Events  should,  of  course,  also  match  the  interest  profile  of  the  recipient.  Wo  will,  in 
this  paper,  assume  that  all  events  are  of  interest,  as  long  as  they  are  novel.  On-line 
adaptive  learning  of  interest  profiles  will  be  addressed  in  another  forthcoming  paper. 

4  Using  the  techniques  presented  in  [4,5],  we  are  currently  investigating  how  one-class 
classifiers  can  be  used  to  learn  the  most  appropriate  hypothesis.  This  would  assume 
that  the  patterns  which  can  be  anticipated  constitute  the  SNE  events,  and  the  set  of 
SE  events,  uliich  cannot  be  anticipated,  constitutes  the  one-class  to  be  recognized. 
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2.2  Learning  Automaton  Based  On-Line  Discovery  and  Tracking  of 
Spatio-temporal  Event  Patterns 

Wc  base  our  work  on  the  principles  of  LA  [9,1 1  .  LA  have  been  used  to  model 
biological  systems  [12],  and  have  recently  attracted  considerable  interest  because 
they  can  learn  the  optimal  actions  when  operating  in  (or  interacting  with)  un¬ 
known  stochastic  environments.  Furthermore,  they  combine  rapid  and  accurate 
convergence  with  low  computational  complexity. 

Generally  stated,  an  LA  chooses  a  sequence  of  actions  offered  to  it  by  a  ran¬ 
dom  environment.  The  environment  can  be  seen  as  a  generic  unknown  medium 
that  responds  to  each  action  with  some  sort  of  reward  or  penalty,  usually  stochas¬ 
tically.  Based  on  the  responses  from  the  environment,  the  aim  of  the  LA  is  to 
find  the  act  ion  that  minimizes  the  expected  number  of  penalties  received.  Before 
we  proceed  with  describing  the  STPLA  itself,  it  is  necessary  for  us  to  first  define 
the  environment  that  we  are  dealing  with 

Spatio-Temporal  Pattern  Environment:  The  purpose  of  the  Spatio-Temporal 
Pattern  Environment  is  to  provide  feedback  to  the  individual  STPLA  about  the  va¬ 
lidity  of  their  respective  hypotheses. 

In  all  brevity,  at  each  time  instant  matching  the  time  primitive  U,  if  an  STPLA 
predicts  the  presence  of  an  event  at  location  lj.  it  informs  the  environment  about 
this  prediction.  Conversely,  if  the  STPLA  predicts  the  absence  of  an  event  at 
the  same  location,  this  too  is  submitted  to  the  environment.  The  environment, 
in  turn,  responds  with  a  Reward  if  an  event  took  place  (or  did  not  take  place) 
as  predicted.  If  the  prediction  is  incorrect,  on  the  other  hand,  the  environment 
responds  with  a  Penalty  instead.  That  is,  the  STPLA  is  penalized  if  an  event 
takes  place,  but  none  was  predicted,  or  if  an  event  is  predicted,  but  does  not 
take  place.  The  latter  reward  policy  is  illustrated  in  Fig.  I. 

rPRRRprrrP 

oooocotooo 


Fig.  1.  Feedback  for  a  daily  event  hypothesis  (R- Reward,  P- Penalty) 


The  figure  illustrates  events  generated  from  a  daily  meeting.  The  STPLA  that 
hypothesizes  a  daily  meeting  will  be  rewarded  each  day  a  meeting  takes  place 
(green  circle)  because  of  its  ability  to  correspondingly  predict  the  daily  event. 
An  important  challenge  that  we  address  in  this  paper  however,  is  how  to  deal 
with  spatio-temporal  event  patterns  that  are  affected  by  noise.  In  the  figure, 
for  example,  some  of  the  daily  meetings  may  be  cancelled  (depicted  by  white 
circles)  due  to  external  conditions,  such  as  when  the  participants  are  unavailable. 
Thus,  when  meetings  are  cancelled,  the  STPLA  maintaining  the  daily  meeting 
hypothesis  will  get  penalized  because  of  its  prediction,  despite  the  fact  that  its 
hypothesis  is  true.  In  a  similar  vein,  so-called  “straggler"  events,  not  being  part 
of  any  periodic:  pattern,  can  also  occur  in  a  sporadic  and  spontaneous  manner. 
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From  the  above  example  it  can  be  seen  that  we  face  two  kinds  of  noise: 

Omission  Error:  This  is  an  error  which  occurs  when  an  event  that  forms  a 
part  of  a  periodic  spatio-temporal  pattern  is  randomly  left  out.  In  other 
words,  the  event  was  supposed  to  have  taken  place  according  to  the  pattern, 
but  did  not.  Notice  the  SE  nature  of  this  event  it  is  not  something  that 
could  have  been  anticipated. 

Inclusion  Error:  This  is  an  error  which  occurs  when  an  event  that  occurs  is 
not  part  of  a  periodic  (anticipated)  pattern,  but  rather  arises  sporadically 
and  spontaneously.  Again,  one  must  observe  the  SE  nature  of  this  event. 

By  way  of  example,  Alice  may  cancel  a  regular  meeting  with  Bob  due  to  ill  health. 
However,  Alice  may  still  meet  Bob  sometime  outside  of  the  regular  meeting 
schedule  purely  by  chance  (e.g.,  an  accidental  meeting  in  the  canteen).  In  this 
manner,  we  can  appropriately  model  both  these  kinds  of  noise. 

The  Spatio-Temporal  Pattern  Learning  Automaton  (STPLA):  We  now 
introduce  the  STPLA  that  we  have  designed  to  discover  and  track  spatio-temporal 
patterns  In  brief,  the  task  of  an  STPLA  is  to  decide  whether  a  specific  spatio- 
temporal  pattern  hypothesis  is  true.  By  observing  events  as  they  unfold,  the 
correctness  of  an  hypothesis  is  decided. 

The  STPLA  can  be  designed  to  model  arbitrarily  general  SE  and  SNE  events. 
But  due  to  space  limitations,  in  this  paper,  we  confine  our  design  and  implemen¬ 
tation  details  to  events  which  can  be  characterized  deterministically . 

The  STPLA  is  inspired  by  so-called  family  of  fixed  structured  LA  [13].  Ac¬ 
cordingly,  a  STPLA  can  be  defined  in  terms  of  a  quintuple  [9]: 

Here,  &  =  {<£i.  <j>2,  •  •  •  ?  <Ps}  is  the  set  of  internal  automaton  states,  a  =  {oq, 

a 2 - ,  or}  is  the  set  of  automaton  actions.  Further,  /3  =  {/?i,/?2, . . . ,  (3m }  is  the 

set  of  inputs  that  can  be  given  to  the  automaton.  An  output  function  at  =  G[<fit} 
determines  the  action  performed  (or  chosen)  by  the  automaton  given  the  current 
automaton  state.  Finally,  a  transition  function  +  i  =  !F[(j>t,/3t]  determines  the 
new  state  of  the  automaton  from:  (1)  The  current  state  of  the  automaton  and 
(2)  The  response  of  the  environment  to  the  action  performed  (or  chosen)  by  it. 

Based  on  the  above  generic  framework,  the  crucial  issue  is  to  design  automata 
that  can  learn  the  optimal  action  when  interacting  with  the  environment.  Several 
designs  have  been  proposed  in  the  literature,  and  the  reader  is  referred  to  [9] 
for  an  extensive  treatment.  In  this  paper,  since  we  target  the  learning  of  spatio- 
temporal  patterns,  our  goal  is  to  design  an  LA  that  is  able  to  discover  and  track 
such  patterns  over  time.  Briefly  stated,  we  construct  an  automaton  with 

—  States:  =  {1,2 . Ari ,  N\  +  1 . . . . ,  N\  +  N2  +  1 } . 

—  Actions:  a  =  {Notify.  Suppress}. 

—  Inputs:  /?  =  {Reward  Penalty}. 
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Fig.  2.  The  state  transition  map  and  the  output  function  of  a  STPLA 


Fig.  2  specifies  the  state  space  of  STPLA  as  well  as  the  Q  and  T  matrices.  The 
Q  matrix  can  be  summarized  a s  follows.  If  the  automaton  state  lies  in  the  set 

{1 . Ar]}.  which  we  refer  to  as  the  Pattern  Evaluation  States,  then  the  LA 

will  choose  the  action  ‘Notify”.  If.  on  the  other  hand,  the  state  is  either  N\  4  1 

or  one  of  the  states  in  the  set  { N\  4*  2 . N\  4-  1X2  -f  1 } .  it  will  choose  the  action 

“Suppress".  We  refer  to  the  state  N\  4-  1  as  the  Pattern  Acceptance  State ,  and 
the  states  {N\  4-  2. . . . ,  N\  4-  1X2  4-  1 }  as  the  Patter'll  Tracking  States  for  reasons 
explained  presently.  Note  that  since  we  initially  do  not  know  whether  a  pattern 
is  present,  we  set  the  initial  state  of  our  automaton  to  1. 

The  state  transition  matrix  T  determines  how  the  learning  proceeds.  I11  brief, 
the  learning  is  divided  into  three  parts: 

Pattern  Evaluation:  In  the  Pattern  Evaluation  part,  the  goal  of  the  LA  is  to 
discover  the  presence  of  the  spatio-temporal  event  pattern  associated  with 
the  maintained  hypothesis,  without  being  distracted  by  omission  and  inclu¬ 
sion  errors.  In  this  phase,  the  state  transitions  illustrated  in  the  figure  are 
such  that  any  deviance  from  the  hypothesized  pattern,  modelled  as  a  Penalty 
(P),  causes  a  jump  back  to  state  1.  Conversely,  only  a  systematic  presence 
of  the  pattern  hypothesized,  modelled  as  a  pure  sequence  of  Rewards  (R), 
will  allowr  the  LA  to  pass  into  the  Pattern  Acceptance  part. 

Pattern  Acceptance:  In  the  Pattern  Acceptance  part,  consisting  of  state  N\  4- 
1.  the  hypothesized  pattern  has  been  confirmed  with  high  probability. 
Pattern  Tracking:  I11  the  Pattern  Tracking  part,  consisting  of  states  4- 
2 . N\  4*  N2  4- 1}.  the  goal  is  to  detect  when  the  discovered  pattern  disap¬ 

pears,  without  getting  distracted  by  omission  errors.  Thus,  this  part  is  the 
"opposite”  of  the  Pattern  Evaluation  part  in  the  sense  that  a  pure  sequence 
of  Penalties  is  required  to  “throw"  the  LA  back  into  the  Pattern  Evaluation 
part  again,  while  a  single  Reward  reconfirms  the  pattern,  returning  the  LA 
to  the  Pattern  Acceptance  part  of  the  state  space. 
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In  other  words,  the  automaton  attempts  to  incorporate  past  deterministic  re¬ 
sponses  when  deciding  on  a  sequence  of  actions. 

We  define  the  “Ensemble?  characteristic  of  a  set  of  STPLA  as  follows:  An 
event  is  only  signalled  to  the  recipient  when  all  of  the  STPLA  that  maintain 
hypotheses  that  are  consistent  with  the  event  collectively  find  themselves  in  the 
Pattern  Evaluation  part  of  the  state  space.  As  soon  as  one  of  the  STPLA  can 
deterministically5  explain  an  event  as  being  part  of  the  corresponding  hypoth¬ 
esized  spatio-temporal  event  pattern,  that  particular  event  will  be  suppressed 
and  no  notification  will  be  issued  to  the  recipient. 

3  Experiments 

In  order  to  evaluate  our  scheme,  we  have  applied  it  to  both  an  event  simulation 
system  as  well  as  to  a  real  world  prototype.  This  section  reports  the  results 
obtained  using  the  simulation,  while  the  next  section  covers  the  prototype. 

Since  one  of  our  main  aims  is  handling  noisy  patterns,  we  intend  to  impose 
“stress”  onto  our  scheme  by  using  a  wide  range  (percentage  or  degrees)  of  omis¬ 
sion  and  inclusion  errors.  We  will  use  q  to  denote  the  probability  of  event  omis¬ 
sion,  while  p  denotes  the  probability  of  event  inclusion.  We  also  investigate  how 
the  number  of  states  N\  and  N2  affect  the  LAs  speed  and  the  accuracy. 

As  a  performance  criterion,  we  have  chosen  the  probability  of  issuing  a  no¬ 
tification  (alert)  when  an  event  takes  place.  We  refer  to  this  probability  as  P\. 
Intriguiugly,  when  a  spatio-temporal  pattern  produces  events,  P\  should  be  min¬ 
imized  while  when  events  are  novel,  Pi  should  be  maximized.  We  will  presently 
see  that  our  scheme  achieves  both.  For  instance,  consider  an  event  that  occurs 
daily,  with  the  possibility,  however,  that  events  may  get  cancelled  (causing  omis¬ 
sion  errors).  I11  that  case,  our  scheme  should  quickly  stop  alerting  the  user  about 
these  events.  In  contrast,  when  novel  sporadic  events  occur  even  on  a  daily  ba¬ 
sis,  our  scheme  should  rather  always  produce  alerts,  so  that  the  user  is  notified 
about  these  novel  events.  Thus,  by  monitoring  our  scheme  in  terms  of  the  index 
Pi  using  various  scenarios,  we  can  capture  its  overall  performance. 


3.1  Performance  after  Convergence 

Table  1  summarizes  the  performance  after  convergence,  with  a  wide  range  of 
event  inclusion  probabilities,  7;,  event  omission  probabilities,  q ,  Pattern  Evalua¬ 
tion  States,  N\,  and  Pattern  Tracking  States ,  N2.  The  resulting  performance  is 
then  reported  in  terms  of  Pi,  with  Pi  being  estimated  by  averaging  over  1,000 
experiments,  each  consisting  of  100,000  iterations. 

In  the  case  of  daily  patterns,  we  have  varied  the  omission  error  probabilities 
from  q  =  0.05  to  q  =  0.2,  thus  covering  a  spectrum  of  small  to  high  degrees 
of  omission  noise.  In  the  case  when  no  patterns  are  present,  we  have  allowed 
random  encounters  to  appear  with  probabilities  from  p  =  0.05  to  p  =  0.2. 

5  The  system  can  easily  be  generalized  for  SE  and  SNE  events  by  rendering  the  tran¬ 
sitions  stochastic. 


LA  Based  On-Line  Discovery  and  Tracking  of  Spatio-temporal  Patterns 


335 


Table  1.  Alert  probability  I\  under  varying  conditions 


Daily  Pattern  No  Underlying  Pattern 


q  =  0.05 

q  =  0  1 

q  =  0.2 

II 

c 

0*1 

=  0.1 

/>  =  0.2 

(A',.A'2) 

(1.5) 

1.5E-8 

9.9E-7 

6.4E-5 

0.735 

0.531 

0.262 

(2.5) 

3.2E-8 

2. IE-6 

1.4E-4 

0.983 

0.925 

0.680 

(3,5) 

TOE- 8 

3.3E-6 

2.4E-4 

0.009 

0.992 

0.916 

(1.5) 

6.7E-8 

1.71  vG 

3.6E-4 

0.909 

0.999 

0.982 

(5,5) 

8.6E-8 

6.2E-6 

5.2E-4 

0.909 

0.999 

0  996 

(5,4) 

1.7E-6 

6.2E-5 

2.6E-3 

0.090 

0.999 

0.997 

(5,3) 

3.4  E- 5 

6.2E-4 

0.012 

0.900 

0.999 

0.998 

(5,2) 

6.0E-4 

6.2E-3 

0.062 

0.999 

0.999 

0.998 

(5,1) 

0.0137 

0.050 

0.254 

0.999 

0.999 

0.999 

From  Table  1.  we  see  that  for  the  best  configuration,  N\  =  N>2  —  5.  we  get 
very  high  accuracy,  with  the  scheme  producing  a  negligible  number  of  superfluous 
notifications  to  the  user,  while  alerting  the  user  of  almost  all  novel  events,  even 
with  high  degrees  of  both  omission  and  inclusion  errors. 


3.2  Performance  in  Dynamic  Environment 

To  investigate  the  ability  of  our  scheme  to  track  spatio-temporal  patterns  that 
change  with  t  ime,  we  have  conducted  several  experiments  in  dynamic  environ¬ 
ments.  In  all  brevity,  we  report  here  a  representative  configuration,  w  here  spatio- 
temporal  patterns  end  after  a  certain  time  period,  while  new  ones  are  introduced 
every  200//l  iteration.  We  modelled  this  by  using  an  omission  error  probability 


Fig.  3.  Evolution  of  the  alert  probability  in  a  dynamic  environment 
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of  q  =  0.2  when  a  pattern  was  present,  and  with  an  inclusion  error  probability 
of  p  =  0.2  when  no  pattern  was  present. 

Fig.  3  depicts  how  the  STPLA  scheme  adapts  to  the  presence  and  absence 
of  patterns  over  time.  For  instance,  prior  to  time  instant  200,  the  probability  q 
was  equal  to  0.2,  implying  the  presence  of  a  daily  pattern.  As  seen,  the  STPLA 
quickly  learns  to  suppress  these  events,  albeit,  with  some  error  due  to  the  high 
omission  error  probability.  When  the  pattern  disappears  after  200  time  steps,  be¬ 
ing  replaced  with  novel  events  only,  we  observe  how  quickly  the  STPLA  changes 
from  suppressing  the  events  to  alerting  the  user  of  them. 

We  thus  conclude  by  stating  that  the  empirical  results  confirm  the  power  of 
STPLA  both  in  noisy  and  dynamic  environments. 

4  Prototype 

In  addition  to  the  empirical  results  presented  in  the  previous  section,  we  have 
also  implemented  a  social  networking  application  and  conducted  real-life  tests. 

A  key  requirement  of  onr  community  based  social  networking  application 
demands  that  users  can  be  made  aware  of  the  Presence  of  their  friends  at  anyt  ime 
and  anywhere  using  their  mobiles  sensing  capabilities.  The  latter  requirement  is 
akin  to  the  field  of  pervasive  computing  where  ad-hoc  mode-based  architectures 
are  recognized  to  be  a  better  alternative  than  infrastructure-based  architecture. 

We  now  provide  a  brief  description  of  our  prototype,  the  details  of  whose 
implementation  can  be  found  in  14].  Our  prototype  system  consists  of  two  mobile 
phones:  IITC  P3300  ami  Sony  Ericsson  AT,  both  of  which  are  equipped  with 
Wi-Fi  modules.  An  ad-hoc  network  is  established  to  provide  a  communication 
platform  where  our  proposed  solution  for  a  “Friend  Reminder"  service  runs. 

This  design  is  based  on  the  “SmokeScreen”  architecture  [1],  which  introduces 
an  effective  approach  to  resolve  privacy  issues  of  Presence  Sharing.  The  sig¬ 
nal  generation  procedure6  referred  in  [1]  is  depicted  in  Fig.  4.  However,  we 
have  added  novel  enhancements  to  the  “SmokeScreen”  approach,  by  introducing 
mechanisms  that  allow  a  finer  level  of  privacy  control.  In  brief,  we  allow  the  user 
to  specify  exactly  which  of  his  friends  can  see  the  signal  of  his  Presence.  Accord¬ 
ingly,  we  let  every  pair  of  friends  share  a  symmetric  key.  This  is  in  contrast  to 
the  results  presented  in  [1]  where  a  user  shares  the  inform  at  ion  of  his  Presence 
with  his  social  network  at  the  granularity  of  his  group.  A  major  disadvantage  of 
the  latter  approach  is  thus  that  the  user  cannot  apply  a  finer  privacy  control  by 
preventing  a  specific  member  of  the  group  from  sensing  the  information  of  his 
Presence  (unless  the  user  does  not  broadcast  the  signal  of  his  Presence).  From  a 
privacy  perspective,  we  believe  that  the  control  of  the  user-related  information 
should  be  fully  under  his  own  control.  Thus,  every  user  should  be  able  to  autho¬ 
rize  the  specific  people  who  have  the  right  to  reveal  his  user-related  information, 
and  to  also  isolate  other  users. 

The  users  must  be  synchronized  to  independently  update  the  Presence  signal 
and  broadcast  it  periodically.  Note  that  the  update  is  deterministic  so  that  every 

6  As  in  [1],  we  use  md5  to  compute  the  signal  and  slial  to  update  the  secret  key. 
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Fig.  4.  The  signal  and  key  generation  over  time  proposed  by  [1]  and  used  by  ns,  where 
Ica~ b  stands  for  the  symmetric  key 


pair  of  participating  users  (for  example  Alice  and  Boh)  can  predict  and  interpret 
the  time  varying  broadcast  Presence  signal.  The  Presence  signal  might  vary  on 
the  hour  and  is  known  only  to  Alice  and  Boh,  thus  preventing  impersonation 
attacks.  As  alluded  to  previously,  we  employed  a  symmetric  key  per  pair  of 
social  contacts.  Consequently,  the  size  of  the  broadcast  Presence  signal  increases 
linearly'  with  the  number  of  social  contacts.  In  order  to  alleviate  t  his  problem,  we 
have  used  Bloom  filters  to  reduce  the  size  of  the  Presence  signal  [15],  and  thus 
the  operation  of  Presence  detection  reduces  to  the  Bloom  filter  match  operation. 

Based  on  the  above  architecture,  we  implemented  our  STPLA  scheme  on  each 
mobile  phone,  allowing  suppression  of  Presence  notification  when  the  Presence 
is  part  of  a  regular  pattern.  In  all  brevity,  the  STPLA  scheme  made  the  “Friend 
Notification  Service"  less  obtrusive  by  only  alerting  the  user  of  novel  events,  but 
suppressed  alerts  for  regular  meetings  (e.g..  for  weekly  fixtures). 

5  Conclusion 

In  this  paper,  we  have  presented  the  Spatio-Temporal  Pattern  Lean  imp  Au¬ 
tomaton  (STPLA)  for  the  on-line  discovery  and  tracking  of  patterns  in  noisy 
event  streams.  Our  scheme  is  based  on  a  team  of  finite  automata,  rendering 
it  computationally  efficient  with  a  minimal  memory  footprint.  The  advantages 
of  our  approach  was  demonstrated  through  extensive  simulations,  as  well  as  a 
prototype  running  on  mobile  devices.  The  scheme  demonstrated  excellent  per¬ 
formance  under  different:  noise  levels  and  in  various  dynamic  settings.  We  thus 
believe  the  STPLA  forms  an  ideal  framework  for  notification  suppression  in  event 
notification  based  systems.  As  a  future  work,  we  intend  to  formally  analyze  the 
behaviour  of  the  STPLA,  as  well  as  to  extend  our  prototype  to  learning  interest 
profiles  and  adaptive  service  recommendations. 
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Abstract.  We  propose  a  graph  model  for  clustering  based  on  mutual 
information  and  show  that  the  clustering  problem  can  be  approximated 
as  a  combinatorial  problem  over  the  proposed  graph  model.  Based  on  the 
stationary  distribution  induced  from  the  problem  setting,  we  propose  a 
function  which  measures  the  relevance  among  data  objects.  This  func¬ 
tion  enables  to  represent  the  entire  objects  as  an  edge- weighted  graph, 
where  pairs  of  objects  are  connected  by  the  edges  with  their  relevance. 
We  show  that,  in  hard  assignment  the  clustering  problem  can  be  approx¬ 
imated  as  a  combinatorial  problem  over  the  proposed  graph  model  when 
data  is  uniformly  distributed.  We  demonstrate  the  effectiveness  of  the 
proposed  approach  over  the  document  clustering  problem.  The  results 
are  encouraging  and  indicate  the  effectiveness  of  our  approach. 


1  Introduction 

Clustering  is  a  process  of  finding  a  partition  of  data  objects  into  mutually  ex¬ 
clusive  and  exhaustive  groups.  The  groups  are  called  clusters.  The  objective  is 
to  find  clusters  of  data  objects  such  that  data  within  the  same  group  are  similar 
to  each  other,  while  data  among  different  groups  are  dissimilar.  Clustering  is  a 
fundamental  data  processing  in  various  fields,  and  has  been  investigated  in  many 
research  communities,  e.g.,  machine  learning,  data  mining,  etc.  [()}. 

In  this  paper  we  consider  data  clustering  under  the  framework  in  [13],  where 
the  clustering  problem  is  formalized  as  a  constrained  optimization  problem  based 
on  mutual  information.  Since  this  problem  is  difficult  to  solve  due  to  the  non¬ 
linearity  of  mutual  information  and  non-convexity  of  the  objective  function, 
several  approximation  algorithms  have  been  proposed  [13,10,9]. 

Based  on  the  stationary  distribution  induced  from  the  problem  setting,  we 
propose  a  function  which  measures  the  relevance  among  data  objects  under  the 
problem  setting.  Since  this  function  captures  the  pairwise  relation  among  data 
objects,  the  ent  ire  data  objects  can  be  represented  as  an  edge- weighted  graph, 
where  data  objects  (which  correspond  to  vertices)  are  connected  by  edges  with 
their  relevance.  The  edge- weighted  graph  for  the  entire  data  objects  is  called  a 
data  graph  in  this  paper. 

We  show  that,  in  hard  assignment,  clustering  based  on  mutual  information  can 
be  approximated  as  a  combinatorial  problem  over  the  proposed  data  graph  when 
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data  is  uniformly  distributed.  Representing  the  entire  dat  a  objects  as  a  data  graph 
and  formalizing  the  clustering  problem  over  the  graph  enable  to  utilize  various 
graph  algorithms  to  solve  the  clustering  with  mutual  information.  We  demonstrate 
the  effectiveness  of  the  proposed  approach  by  utilizing  spectral  clustering  over  the 
proposed  graph  model  and  evaluating  it  on  the  document  clustering  problem.  The 
results  are  encouraging  and  show  the  validity  of  the  proposed  approach. 

Our  contributions  arc:  1)  proposal  of  a  graph  model  for  clustering  based  on 
mutual  information,  2)  clarification  of  the  correspondence  between  the  original 
clustering  problem  and  the  combinatorial  problem  over  the  graph  model,  and  3) 
validation  of  the  proposed  approach  over  the  document  clustering  problem. 

Section  2  explains  the  problem  setting.  Section  3  explains  the  details  of  our 
approach.  Section  4  reports  the  results  of  experiments  and  comparison  with  other 
approaches.  Section  5  gives  concluding  remarks. 

2  Problem  Settings 

2.1  Preliminaries 

Let  X  be  a  set  of  data  objects.  For  a  set  X ,  \X\  represents  its  cardinality. 

Suppose  A"  stands  for  a  random  variable  over  the  domain  X,  and  p\(x)  and 
P‘2(x)  are  probability  distributions  for  A. 

Definition  1.  Kullback-Lcibler  (XL)  divergence  between  two  probability  distri¬ 
butions  p\{x)  and  p2 (.r)  for  a  random  variable  X  is  defined  as  [lj: 


(i) 


Suppose  X  and  Y  are  two  random  variables  (their  domains  an'  X  and  A’),  and 
p{x,y)  stands  for  their  joint  probability  distribution.  Let  p(x)  and  p(y)  stand  for 
their  marginal  probability  distributions,  and  p{y\x)  stands  for  the  conditional 
probability  distribution  of  Y  given  the  observation  of  X . 

Definition  2.  Mutual  Information  I(X;Y)  between  two  random  variables  X 
and  Y  is  defined  as: 


=  DKL[p(j-.y)\\p{x)p{y)} 


(3) 


2.2  The  Information  Bottleneck  Framework 

Data  clustering  based  on  mutual  information  was  proposed  in  [13].  The  objective 
is  to  find  clusters  T  of  data  objects  X  such  that  the  clusters  are  still  informative 
about  the  specified  relevant  variable  Y.  Random  variables  X  and  T  corresponds 
to  X  and  T,  and  T  should  be  completely  defined  given  X  and  irrelevant  to  Y . 

For  instance,  suppose  a  set  of  documents  A*={.ri  . ..  ,.x*„}  is  specified,  each 
of  which  contains  a  “bag”  of  terms  to  describe  the  document.  Here,  the  set  of 
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whole  terms  utilized  to  describe  the  documents  corresponds  to  Y={y  1 . t/m}. 

/)(.!%;(/)  represents  the  joint  probability  of  a  document  x  containing  a  term  y,  and 
can  be  estimated  by  the  co-occurrence  of  x  and  y.  The  goal  of  data  clustering 

is  to  find  a  partition  T—{t \ . tr}  of  X  such  that  T  is  still  informative  about 

Y .  Here,  each  t  G  T  corresponds  to  a  cluster  of  documents. 

Data  clustering  is  formalized  as  a  constrained  optimization  problem  [13]. 

Problem  1.  Find  the  conditional  probability  distribution  p(t\:v)  which  minimizes 
the  following  objective  function 

C  =  I(X ;  T)  —  fiI{T ;  Y)  (4) 

where  I(X:T)  and  I(T;Y)  are  mutual  information  between  A  and  T  and  be¬ 
tween  T  and  Y ,  respectively,  p  is  a  control  parameter. 

Intuitively,  I(X;T)  corresponds  to  the  compactness  of  new  representation  T  for 
representing  the  value  of  A\  while  I(T:)r)  corresponds  to  the  accuracy  of  T  for 
predicting  the  value  of  Y .  It  was  shown  that  the  optimal  solution  of  Problem  1 
should  satisfy  the  following  self-consistent  equations  [13,8]. 

Theorem  1,  When  p(x,y)  and  ft  arc  specified .  and  Mavkoman  relation  T 
X  <-*  Y  holds ,  p(t\x)  is  a  stationary  point  of  C  if  and  only  if  p{t\x)  satisfies  the 
following  equal  ions: 

p('k)  =  -(WklIpOA-t)  I I/H//I0] )  (5) 

Z(.r,J)  =  [/'(.'/lJ')ll/'(.v|0])  (c) 

t 

2.3  Approximation  Algorithms 

The  closed  form  formula  in  eq.(5)  indicates  that  p(t]x)  is  the  stationary  distri¬ 
bution  under  the  problem  setting.  However.  p{t\x),  the  left  hand  sick'  of  eq.(5), 
implicitly  (and  lion- linearly)  affects  its  right  hand  side  under  this  framework. 
Furthermore,  the  objective  function  C  in  eq.(4)  is  not  convex  with  respect  to 
p{t\x),  p(t ),  p(y\t)  simultaneously.  Tims,  it  is  quite  difficult  to  find  the  global 
optimal  solution  of  Problem  1. 

Several  algorithms  were  proposed  to  find  out  approximated  solutions  of  eq.(4) 
13,10,9,8].  It  is  reported  that  an  algorithm  called  sIB  outperformed  other  algo¬ 
rithms  in  t  erms  of  the  quality  of  clusters.  This  algorithm  returns  a  hard  assign¬ 
ment,  i.c .,  each  data  is  assigned  only  to  one  cluster. 

3  A  Graph-Based  Approach 

3.1  Preliminaries 

A  graph  G(V.E)  consists  of  a  finite  set  of  vertices  V ,  a  set  of  edges  E  over 
V  x  V.  The  set  E  can  be  interpreted  as  representing  a  binary  relation  on  V .  An 
edge-weighted  graph  (7(V,  E.  W)  is  defined  as  a  graph  G(V .  E)  with  the  weight 
on  each  edge  in  E.  When  | V" |  =  n,  the  weights  in  W  can  be  represented  as  an 


312 


T.  Yoshida 


n  by  n  matrix  W1  ,  where  Wij  in  W  stands  for  the  weight  on  the  edge  for  the 
pair  (?;*,  v3 )  €  E.  We  set  wl3  =  0  if  the  pair  (i Vj)  is  not  in  E. 

3.2  A  Pseudo-similarity  Function 

Based  on  Theorem  1  and  eq.(5),  we  regard  that  D  k  L\j>{y\x)\\p(y\t)\  represents 
the  pseudo-dissimilarity  between  x  (data  object)  and  t  (cluster)  under  the  frame¬ 
work  in  Section  2.  Furthermore,  we  extend  this  insight  from  X  x  T  to  X  x  X , 
and  propose?  to  utilize  KL-divergence  as  a  pseudo-dissimilarity  function  between 
data  objects  for  the  clustering  problem. 

Based  on  the  above  argument,  we  propose  the  following  function,  which  cor¬ 
responds  to  a  pseudo-similarity  function  under  the'  framework  in  Section  2. 

Definition  3.  A  function  s:  X  x  X  — >  1Z+  is  defined  as 

s(xi, Xj)  -  p(xj )exp{-0DK l \]>(y\xt ) I \p(y\xj )] )  (7) 

where  (3  is  the  eontrol  parameter  in  Problem  1. 

3.3  A  Data  Graph 

The  function  defined  in  eq.(7)  represents  the  pairwise  relation  among  data  ob¬ 
jects.  Since  a  pairwise  relation  can  be  represented  as  a  graph,  wc  propose  to 
represent  this  relation  as  an  edge- weighted  graph,  using  the  s(xuSj)  in  eq.(7)  as 
the  weight  for  the  edge  (rrt,  x3). 

Definition  4.  For  a  set  of  objects  X,  by  mapping  each  data  object  to  a  vertex , 
an  edge-weighted  graph  G(V ,  E,W)  is  defined  as: 


II 

(8) 

=  /  *{*h*j)  *i 
,J  )  0  otherwise 

(9) 

E  =  {(xi,Xj)\s(xt.Xj)  >  0} 

(10) 

From  eq.(8),  wc  abuse  the  symbol  X  to  denote  the  set  of  vertices  in  the  data 
graph.  Note  that  the  weights  are  non-negative.  We  call  this  graph  the  data  graph 
in  this  paper.  Wc  assume  that  the  data  graph  G  for  a  given  X  is  connected2. 
We  define  the  conditional  probability  over  the  data  graph3  as 

(11) 

2^j  Wij 

Proposition  2.  The  conditional  probability  in  eq.(ll)  is  a  stationary  distribu¬ 
tion  in  Theorem  1  where  T  —  X . 


1  A  bold  italic  symbol  W  denotes  a  set.  while  a  bold  symbol  W  denotes  a  matrix. 

2  Each  vertex  has  at  least  one  edge  with  positive  weight.  For  disconnected  graphs, 
w.l.o.g.,  each  component  can  be  dealt  with  separately. 

1  It  is  easy  to  verify  that  cq.(ll)  is  a  valid  conditional  probability. 
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Proof.  By  treating  each  x ,  as  /,  it  is  easy  to  confirm  from  cqs.(7)  and  (9).  □ 

Setting  T  —  X  corresponds  to  one  extremal  situat  ion  where  no  compression  of 
X  is  conducted.  In  that  situation.  Proposition  2  says  that  p(Xj\x,)  in  eq.(ll)  over 
the  data  graph,  based  on  the  function  in  eq.(7),  satisfies  the  necessary  condition 
for  the  optimal  solution  of  Problem  1. 


3.4  A  Graph-Based  Formalization 

We  shall  show  that,  for  hard  assignment.  Problem  1  can  be  approximated  as  the 
following  problem  over  the  data  graph  when  data  is  uniformly  distributed. 

Problem  2.  When  the  number  of  clusters  k  is  specified,  find  k  disjoint  subsets 
{E\ . Ek}  of  edges  in  the  data  graph  G  which  minimize 


k 


(12) 


and  the  removal  of  the  edges  from  G  results  in  k  disconnected  components. 


Objective  functions.  Note  that  when  random  variables  X  and  Y  are  specified, 
/( X;  Y  )  is  some  constant  value.  Based  on  this  fact.  Problem  1  can  be  transformed 
into  the  following  equivalent  problem  for  any  fixed  (see  [13,8]). 

Problem  3.  Find  the  conditional  probability  distribution  p(t |.r),  which  minimizes 
the  following  objective  function 

p(x)p(t\x)(-  log  Z(x,  0))  (13) 

X  t 

In  the  data  graph  G.  the  objective  function  is  represented  as 

Fa  =  ^^p(xi)p(j-j|.ri)(-logZ(.r,,/i))  (14) 

:r,  Xj 

We  define  the  sum  of  weights  on  the  edges  from  Xi  as  d/1. 

di  =  '£wij,  V.Cj  £  X  (15) 

We  introduce  one  assumpt  ion  to  show  our  result. 

Assumption  1.  Data  is  uniformly  distributed  and  p(x)  is  some  constant  e  >  0. 
Hereafter,  Assumption  1  is  called  as  uniform  distribution. 

Proposition  3.  Under  unij on n  distribution ,  Fq  is  some  constant  for  X . 


4  oinges  over  X  and  corresponds  to 
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Proof 


Fa  =  E  51  P(xi)P(xi  1()S  z(*i- 0)) 

Xi  x.j 

=  rE(_log  Z(xi.p))'Y^p(xj\xi) 

Xi  Xj 

=  cB-log^)E^ 

Xi  Xj  1 

=cE_1<®^') 


(16) 

(17) 

(18) 


Since  p(x.)  =  c  and  Z(xu(3)  =  Y,XJ  P(xi)exP(-0DKL\p(v\xi)\\p{v\xj)}) 
=  Wij  —  di ,  arid  di  is  some  constant  for  each  x,,  cq.(16)  follows.  Eq.(17) 
follows  from  cq.(ll)  ,  and  =  1  for  each  data  Xi  induces  eq.(18).  Since 

each  —  logdi  is  some  constant  as  in  eq.(16).  Proposition  3  holds.  □ 

Compression  and  cut.  Let  ns  consider  a  2- way  partition  of  X  into  two  mu¬ 
tually  exclusive  and  exhaustive  sots,  i.e.,  X=  S  U  £r>  [14].  S  and  S  corresponds 
to  two  clusters  of  objects.  By  removing  or  cutting  the  edges  between  S  and  S, 
the  data  graph  G  is  partitioned  into  two  induced  subgraphs  Gs  and  Gs  [4],  and 
becomes  a  (disconnected)  graph  G  =  {G«?,Gg}. 

Definition  5.  We  define  the  following  to  characterize  a  partition. 

cut(s, s)  =  e  E  w'j'  cvt(S’s)  =  E  E  w,j  (l9) 

JieSxjeS  Xi€Sxj€S 

As  in  eq.(  1 1),  for  any  partition  of  the  data  graph  G  where  each  induced  subgraph 
Gs  with  |S|  >  1,  ^ a  valid  conditional  probability  distribution  over 

Gs. 

For  each  x,  G  X ,  let  us  denote  the  subset  of  X  which  contains  as  Si ,  and 
the  other  subset  as  St .  As  in  eq.(15),  we  define  the  followings. 


tVi 


dsi  —  ^  wi  j  •  dg{  —  ^ 
xjes  Xje$ 

The  following  relation  holds  between  eqs.(15)  and  (20)  for  any  x,  in  G. 

di  =  dsi  4*  dSi 

We  define  the  conditional  probability  distribution  over  G  —  {Gs-G$}  as: 
Vxv  €  5,  z3(xdx,) 


Vxi  €  5,  p{xj\xi) 


U>ij 

dsi 

■  xj  €  S 

0 

otherwise 

Wij 

dsj 

1  xj  €  5 

0 

otherwise 

(20) 

(21) 

(22) 

(23) 


S  is  the  complement  of  S. 
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Fcs  and  Fc,\.  are.  defined  as  eq.(14),  and  can  be  rewritten  under  uniform  distri¬ 
bution  as: 

Fo  =  Y  top,  z (24) 


=  FCs  +  FGs  (26) 

Note  that  p(xj|x*)  defined  in  eqs.(22)  and  (23)  does  not  satisfy  eq.(5),  since 
p(Si\xi)  =  1  and  p(S,\:ri)  =  0  for  all  x,  G  X(\  and  deviates  from  e(j.(5)  due  to 
the  hard  assignment  of  each  .vt  into  S,7 .  Wo  would  like  to  minimize  the  deviation 
to  solve  Problem  1.  From  Proposition  3,  minimization  of  the  deviation  F(%  -  Fa 
is  equivalent  to  t  lie'  following  problem. 

Problem  4 ■  For  any  set  of  objects  X .  find  the  2-wav  partition  X=  S  LJ  S  of  the 
data  graph  G  which  minimizes  Fc;  =  Fas  +  Fc$  in  G  =  {Gs.Gg}. 

Main  result.  We  show  the  correspondence  between  Problem  1  and  Problem  2. 
First,  we  define  the  following  problem. 

Problem  5.  For  the  data  graph  G,  find  two  disjoint  subsets  {E^E?}  of  edges 
which  minimize 

2 

d  =  Y  Y  w'i  (27) 

*  1  n'ijcE, 

and  the  removal  of  the  edges  from  G  results  in  a  disconnected  graph  G  — 
{Gs,G$}<  where  G$  and  Gs  are  components  of  G. 

Claim.  In  hard  assignment,  Problem  1  can  be  approximated  as  Problem  5  under 
uniform  distribution. 

Proof.  As  explained,  Problem  1  can  be  reduced  to  Problem  4.  Thus,  we  show 
the  correspondence  between  Problem  4  and  Problem  5.  I11  the  following,  symbol 
represents  the  equivalence,  and  symbol  ~  represents  the  approximation. 

min  Fc  ^  min(  Y  ( —  )  +  Y  (-  log dSj)} 

xjtS 

-  min{  Y  dS,  +  Y  dSj  +  Y  ^  ~  dM  (28) 

Xj£S  XiZSuS 

min{  d$, +  Y  ’Is,}  (29) 

Zt€S  Xj£S 

min  {cut  (5,  S)  +  cut(S ,  5)}  (30) 

6  S  and  S  corresponds*  to  clusters. 

'  Any  hard  assignment  deviates  from  eq.(5). 
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2 

<=>  min  ^  ^  mi  (31) 

f==1  WijeEt 

The  first  equation  holds  under  uniform  distribution  from  eq.(2G).  Based  on 
eq.(21).  Taylor  expansion  of  log  function  as  (—  log ds- )  ~  d§  T  (1  —  d()  shows 
that  eq.(28)  holds.  As  Proposition  3,  since  each  dt  is  some  constant  in  G,  it 
is  equivalent  to  eq.(29).  From  eq.(20)  and  the  definition  of  citt.(S<  S),  eq.(29)  is 
equivalent  to  eq.(30),  and  the  latter  is  equivalent  to  eq.(31).  □ 

The  above  Claim  can  be  easily  extended  to  more  than  two  clusters. 

Claim.  In  hard  assignment.  Problem  1  can  be  approximated  as  Problem  2  under 
uniform  distribution. 

3.5  Clustering  Based  on  Data  Graph 

Section  3.4  shows  that  the  clustering  problem  in  Section  2  can  be  tackled  by 
solving  the  combinatorial  problem  (Problem  2)  over  the  proposed  data  graph. 
Various  graph  algorithms  have  been  proposed  for  solving  tins  kind  of  problem 
efficiently  [11  and  can  be  utilized  via  the  proposed  reduction  of  the  problem. 

However,  it  is  known  that  small  unbalanced  clusters  tend  to  be  created  under 
the  minimum  cut  formulation  of  partitioning  [14].  From  the  objective  of  data 
clustering,  unbalanced  clusters  are  not  desirable.  Thus,  when  solving  Problem  2 
over  the  data  graph  in  addition  to  minimizing  the  objective  function,  it  would 
be  important  to  consider  the  balance  between  the  clusters. 

4  Evaluations 

4.1  Application  for  Document  Clustering 

Although  the  proposed  method  is  generie  and  not  specific  to  document  cluster¬ 
ing  following  the  previous  work  [8,2  ,  we  evaluated  the  proposed  approach  on  the 
document  clustering  problem.  Similar  to  the  example  in  Section  2.2,  for  a  given 
documents  X,  the  set  of  terms  which  are  utilized  to  describe  the  documents  cor¬ 
respond  to  Y—{y i, . . .  ,t/m}.  and  p( x.  y)  corresponds  to  the  joint  probability  of 
a  document  x  and  a  term  y.  Since'  the  number  of  terms  are  huge  in  general,  the 
document  clustering  problem  corresponds  to  the  clustering  of  high-dimensional 
sparse  data.  Since  the  proposed  approach  is  a  partitioning  based  method,  we 
assume  that  the  number  of  clusters  k  is  specified. 

Based  on  the  procedure  in  [8,2],  we  evaluated  the  proposed  approach  over  the 
20  Newsgroup  data  (20NG)8.  which  has  been  utilized  as  a  standard  benchmark 
in  document  processing  community.  Three  sets  of  groups  are  created,  as  shown 
in  (Table  1).  As  in  [8,2],  50  documents  were  sampled  from  each  group  in  order 
to  create  one  dataset.  We  repeated  this  process  and  created  10  datasets  for  each 
set  of  groups.  For  each  dataset,  we  conducted  stemming  using  porter  stennner9 


8  http://peopIc.csail.init.edu/  jrennie/20Newsgroups/.  20news- 18828  was  utilized. 

9  http://www.tartarus.org/  martin/ PorterS temmer 
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Table  1.  Datasets  from  20  Newsgroup  dataset 


dataset 

included  groups 

Mnlti5 

conip, graphics, roc, motorcycles, rec .sport. baseball,  sci. space,  talk,  politics,  niideast 

MultilO 

alt. atheism,  comp.sys  mac,  hard  ware, misc.forsale,  rec.autos.rec.spoi  t.  hockey, 
sci. crvpt, sci, med.  sci. electronics, .« sci. space,talk.politics.gniis 

Multi  15 

alt. atheism,  comp, graphics,  comp, sys.inac. hardware,  misc.forsale,  rec.antos, 
rec,  motorcycles,  rec  .sport,  baseball,  rec.  sport. hockey,  sci. crypt,  sci.  electronics, 
sci. med,  sci. space,  talk. politics. guns,  talk, politics. niideast,  talk,  politics,  misc 

and  Monty  Tagger10,  removed  stop  words,  and  selected  2,000  words  with  large 
mutual  information  1]. 


4.2  Experimental  Settings 


Compared  Methods.  For  each  dataset,  we  constructed  the  data  graph  in 
Section  3.3  and  conducted  clustering  by  solving  Problem  2  over  the  graph.  As 
described  in  Section  3.5,  it  is  important  to  consider  the  balance  among  clusters. 
YVe  utilized  spectral  clustering  to  fulfill  this  objective  [14].  For  each  pair 
the  edges  with  wtJ  and  U'Jl  are  defined  in  the  data  graph.  However,  these  should 
be  removed  simultaneously  for  partitioning.  Thus,  we  set  the  symmetric  matrix 
+  wJt)/2  in  the  following  experiment. 

Two  representative  normalized  graph  Laplaeian  have  been  proposed  and  uti¬ 
lized  based  ori  the  diagonal  matrix  I),  which  is  filled  with  d,  in  eq.(15)  [14]: 
Lru,  =  I  D  *W.  L*ym  =  I  —  D  WD  2.  We  utilized  both  of  them  and, 
constructed  Hrw  and  YLsym.  Clustering  was  conducted  on  these  representations 
using  spherical  kineaus  (skmeans). 

We  compared  the  proposed  approach  with  ilB  and  slB  in  [13.9],  and  with 
skmeans  [3]*  1 1 .  ilB  tries  to  find  the  stationary  distribution  in  eq.(5)  via,  projec¬ 
tion,  and  slB  conducts  sequential  re-assignment  of  data  into  clusters.  The  joint 
probability  ]>{.r,y)  w*as  estimated  from  each  dataset  using  Ristad  method  [7]. 


Evaluation  Measure.  For  each  dataset,  cluster  assignment  was  evaluated 
w.r.t.  the  following  Normalized  Mutual  Information  (NMI)  [12]  Let  T.  T  stand 
for  the  random  variables  over  the  true  and  assigned  clusters.  NMI  is  defined  as 


NMI  = 


I(T ;  T) 

(H(f)  +  H(T))/ 2 


(€  [0,1]) 


(32) 


where  //(T)  is  Shannon  Entropy.  The  larger  NMI  is,  the  better  the  result  is. 
Although  we  have  evaluated  purity  [5],  the  results  are  omitted  for  page  limit. 


Parameters.  f3  in  eq.(7)  is  the  control  parameter  in  the  problem  setting  in 
Section  2.  slB  makes  it  irrelevant  to  this  parameter  by  setting  it  a  very  large 

10  http://web.ineclia.niit.edu/  h  ugo/inon  tv  tagger 

11  Since  skmeans  is  the  standard  clustering  algorithm  for  high-dimensional  sparse  data, 
and  this  was  used  as  a  baseline  method. 
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value  (/5=104)  [9,8];  however,  both  ilB  and  the  proposed  approach  are  affected 
by  its  value.  Thus,  we  conducted  preliminary  experiments  and  set  the  value  as 
ft  €  [1, 100]  for  ilB  and  as  (3  G  [  1 0 — 2 , 1]  in  the  following  experiments. 

The  number  of  eigenvectors  l  also  affects  the  performance  in  spectral  cluster¬ 
ing.  Basically  it  was  set  as  l  =  k  (the  number  of  clusters);  however,  for  Multif) 
it  was  set  to  10  since  setting  l  to  5  was  too  considered  as  too  low. 


4.3  Results 

We  conducted  experiments  on  30  datasets.  10  for  each  set  of  groups.  For  each 
dataset  we  conducted  10  runs  of  experiment  in  order  to  account  for  the  influence 
of  initial  configuration  in  clustering,  and  calculated  their  average.  However,  for 
sIB,  following  the  procedure  in  [9,8],  for  each  dataset  the  best  result  in  10  runs 
was  utilized  to  calculate  the  average.  The  results  are  shown  in  Fig.  1.  In  Fig.  1 
kl-rw  (red  line)  stands  for  the  proposed  approach  with  L ruM  and  kl-sym  (blue 
line)  for  the  proposed  one  with  Lsym.  The  compared  methods  are:  sIB  (green 


Mu1tl5  (NMI) 


A  kl-rw 
x  kl-sym 
■  sIB 

o  skmeans 


00  02  04  06  0!  10 

tMta 


MultiS  (NMI) 


t»u 


1 


I 


MultllO  (NMI) 


MultllO  (NMI) 


Fig.  1.  Result  on  20NG  (w.r.t.  NMI) 
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line),  ilB  (water  blue  line),  and  skmeans  (black  dotted  line)).  Since  3  is  the  main 
parameter,  horizontal  axis  corresponds  to  [i .  and  vertical  one  to  NMI. 

As  for  NMI,  which  corresponds  to  the 
correctness  of  data  assignment  for  clusters.  Table  2.  Comparison  with  spectral 
the  proposed  method  with  Lrtr  (kl-rw)  out-  clustering  (NMI) 


performed  other  methods  with  respect  to  dataset 
Multi  10  and  Multi  15.  On  the  other  hand,  for 

L»rUi 

L.vyrn 

proposal  +Lrtr 
(0  =  10-2) 

Multi5.  it  outperformed  ilB  and  skmeans,  but  Multio 

0.573 

0.641 

0.627 

it  was  below  sIB.  Multi  10 

0.534 

0.497 

0.720 

We  also  compared  the  proposed  approach  Multi  15 
with  the  standard  spectral  clustering  [14]  ns- 

0.464 

0. 124 

0.74 1 

ing  cosine  similarity,  which  is  widely  utilized  in  document,  analysis 

as  a  standard 

similarity  measure.  Results  are  summarized  in  Table  2.  Table  2  shows  that,  the 
proposed  approach  clearly  outperforms  the  standard  spectral  clustering.  Thus, 
this  validates  the  effectiveness  of  the  proposed  graph  model  in  Section  3. 

As  for  the  influence  of  /3,  the  proposed  approach  (both  kl-rw  and  kl-sym)  is 
stable  for  different  values  of  3  and  thus  can  be  considered  as  robust  to  this 
parameter  In  i I B.  the  performance  varied  from  the  value  of  1  to  20,  but  after 
that  it  became  rather  stable  with  the  value  of  3. 


4.4  Discussion 

With  respect  to  finding  out  the  stationary  distribution  in  Theorem  I.  the  pro¬ 
posed  approach  corresponds  to  ilB.  Since  the  proposed  approach  outperformed 
ilB  in  all  the  datasets  in  Fig.  1.  t lie  results  confirmed  the  validity  and  the  effec¬ 
tiveness  of  the  proposed  approach. 

The  proposed  approach  formalizes  Problem  1  as  the  corresponding  combina¬ 
torial  problem  based  on  the  induced  conditional  probability  over  the  data  graph. 
L,  „,  conducts  the  normalization  of  graph  Laplacian  based  on  the  random  walk 
over  the  graph,  which  is  induced  from  the  weights  of  the  graph  [14].  Thus,  al¬ 
though  both  Lrl<?  and  L,sym  are  widely  utilized,  the  former  seems  to  match  the 
proposed  approach  in  terms  of  the  conditional  probability  interpretation.  Fur¬ 
thermore,  the  results  in  Fig.  1  also  validate  that  Lrw  is  more  suitable  for  the 
proposed  data  graph.  Thus,  the  proposed  approach  can  be  considered  as  a  valid 
model  for  data  clustering  based  on  mutual  information  in  Section  2. 

Although  the  proposed  method  is  generic  and  not  specific  to  document  cluster¬ 
ing.  based  on  the  previous  work  [8,2].  we  evaluated  the  proposed  approach  over 
the  document  clustering  problem.  Since  the  proposed  method  (kl-rw)  outper¬ 
formed  sIB  for  both  Multi  10  and  Multi  15,  these  results  showed  its  effectiveness 
for  the  situation  where  the  number  of  clusters  are  large.  However,  although  it 
outperformed  the  standard  spectral  clustering,  it  was  below  sIB  for  Multif).  One 
of  the  reasons  is  that,  the  original  Problem  1  in  Section  2  is  formalized  based 
on  KL  divergence,  but  this  divergence  can  be  rather  numerically  instable  when 
the  zero  frequency  problem  in  document  processing  occurs.  Coping  with  this 
problem  is  left  for  future  work. 
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5  Concluding  Remarks 

We  proposed  a  graph  model  for  clustering  based  on  mutual  information.  Based  on 
the  stationary  distribution  induced  from  the  problem  setting,  a  pseudo-similarity 
function  was  proposed  and  utilized  to  formalize  the  clustering  problem  over  the 
proposed  graph  model.  We  have  shown  that,  in  hard  assignment  the  clustering 
problem  can  be  approximated  as  a  combinatorial  problem  over  the  proposed 
graph  model  when  data  is  uniformly  distributed.  We  demonstrated  the  effective¬ 
ness  of  the  proposed  approach  by  utilizing  spectral  clustering  and  evaluating  it 
on  the  document  clustering  problem.  The  results  are  encouraging  and  indicate 
the  effectiveness  of  our  approach.  We  plan  to  pursue  this  line  of  research  to 
overcome  the  problem  related  with  the  instability  of  KL  divergence. 
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Abstract-  Recently,  the  online  auction  has  become  a  popular  Internet  service. 
Since  the  service  has  been  expanded  rapidly,  security  risks  in  the  system  re¬ 
main.  Fundamental  measures  are  still  required.  This  paper  proposes  a  method 
for  detecting  shill  bidders  in  online  auctions.  It  first  detects  outliers  with  a  one- 
class  SVM.  It  then  transforms  the  results  into  a  decision  tree  using  C4.5.  The 
experiment  results  demonstrate  that  we  ean  use  the  resulting  rules  to  classify 
shill  bidders. 

Keywords:  Online  auetion.  Shill  bidders.  One-class  SVM.  Decision  tree. 


1  Introduction 

An  online  auction  is  a  service  that  enables  ordinary  people  to  sell  their  items  to  those 
who  will  pay  the  most  for  them.  Auction  sites  are  set  up  so  that  people  who  wish  to 
sell  their  items  can  display  them  to  potential  buyers  (i.e.,  bidders).  The  bidding  system 
enables  competition  between  buyers.  The  buyer  who  offers  the  highest  price  can  ac¬ 
quire  the  item  [1].  Here,  both  sellers  and  buyers  are  ordinary  people.  They  are  not 
professional  participants  in  these  auctions. 

Recently,  online  auction  services  have  expanded  so  rapidly  that  various  security 
risks  in  the  system  have  been  revealed.  Fundamental  measures  are  required  against 
unfair  practices.  Note  that  both  sides  can  engage  in  unfair  practices.  Both  sellers  and 
buyers  may  suffer  from  unfair  practices.  For  example,  unfair  sellers  may  try  to  steal 
money  from  buyers  without  sending  the  purchased  products.  On  the  other  side,  unfair 
buyers  may  try  to  steal  items  without  paying  the  money.  Other  types  of  unfair  prac¬ 
tices  are  also  observed.  Although  the  bidding  systems  have  to  provide  a  means  to  pro¬ 
tect  both  types  of  users  (sellers  and  buyers)  from  such  unfair  practices,  they  always 
end  up  reacting  after  a  new  type  of  unfair  practice  emerges. 

Among  various  unfair  practices  from  the  buyer’s  side,  the  issue  of  “shill  bidders’’ 
[2]  remains  unsolved.  This  paper  proposes  a  method  to  detect  shill  bidders  in  an 
online  auction.  A  characteristic  of  the  proposed  method  is  its  semi-automatic  function 
for  finding  a  new  type  of  shill  bidders.  It  first  finds  outliers  based  on  buyer  behavior. 
It  then  analyzes  the  outliers  to  detect  shill  bidder  behavior.  A  one-class  SVM  and  de¬ 
cision  tree  learning  algorithm  C4.5  are  used  to  find  a  new  type  of  shill  bidders.  We 
demonstrate  that  a  simple  combination  of  these  standard  learning  methods  is  effective 
in  coping  with  newly  devised  unfair  practices.  Abnormal  behavior  associated  with 

B.-T.  Zhang  and  M.A.  Orgun  (Eds.):  PRICAI  2010,  L  NAI  6230.  pp.  351-358,  2010. 
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unfair  practices  can  he  detected  in  the  form  of  outliers.  Rules  generated  by  the  deci¬ 
sion-tree  learning  method  can  classify  these  outliers  and  can  discriminate  unfair  prac¬ 
tice  from  innocent  outliers.  Since  buyers  in  an  online  auction  are  ordinary  people, 
innocent  outliers  do  exist.  Thus,  the  one-class  SVM  alone  cannot  solve  the  problem. 

This  paper  is  organized  as  follows.  After  Section  2  briefly  surveys  unfair  practices 
in  online  auctions  and  related  works  on  efforts  to  cope  with  them.  Section  3  describes 
our  approach.  Section  4  then  reports  the  experiment  results.  Finally,  Section  5  sum¬ 
marizes  our  findings. 


2  Online  Auctions  and  Related  Issues 

‘‘Shill  Bidders”  try  to  get  unfair  profits  by  cheating  innocent  buyers.  They  try  to  pull 
up  the  price  of  their  items  in  unfair  ways.  A  typical  trick  that  they  use  is  to  employ 
forged  bidders.  When  a  shill  bidder  goes  to  sell  his  item  at  auction,  he  begins  by  put¬ 
ting  his  item  up  for  auction.  He  also  prepares  forged  buyers.  Typically,  forged  buyers 
are  actually  the  shill  bidder  himself.  He  uses  multiple  IDs  as  buyers.  When  an  inno¬ 
cent  buyer  places  a  bid  on  the  item,  the  forged  buyers  inflate  the  price  by  bidding  a 
higher  price.  After  the  price  goes  up  the  forged  buyers  stop  bidding,  and  the  cheated 
innocent  has  to  pay  the  inflated  price. 

The  automatic  bidding  support  system  of  online  auctions  makes  the  situation  worse 
(Fig.  1).  It  was  originally  designed  to  help  innocent  buyers.  The  function  of  the  auto¬ 
matic  bidding  system  is  to  make  successful  bids  for  the  items  that  the  buyer  wishes  to 
buy.  Within  a  certain  price  range  set  by  the  buyer,  the  system  automatically  places 
bids,  inflating  the  price  little  by  little.  A  shill  bidder  can  also  use  this  system  to  create 
forged  buyers.  Hence,  the  use  of  forged  buyers  in  an  online  auction  is  easier  than  in  a 
traditional  auction  where  a  real  person  has  to  participate. 

To  address  this  problem,  a  variety  of  research  studies  have  been  conducted.  Yokoo 
et  al.  point  out  that  this  problem  is  enabled  by  free  mail  accounts  [3].  A  shill  bidder 
can  use  multiple  free  mail  accounts  to  imitate  the  participation  of  multiple  buyers. 
They  also  propose  an  auction  protocol  that  can  prevent  the  participation  of  forged 
buyers,  Matsuo  et  al.  [4|  discuss  another  auction  protocol  that  can  also  prevent  shill 
bidders  in  combination  auctions. 

The  research  of  Yokoo  and  Matsuo  endeavors  to  prevent  forged  buyers,  i.e.,  shill 
bidders,  using  the  mechanisms  of  the  auction  site.  This  paper  seeks  to  reduce  the  risk 
from  shill  bidders  by  scmi-automatically  identifying  them.  In  other  words,  this  paper 
complements  the  studies  mentioned  above. 

Deborah  sought  to  predict  the  closing  price  for  a  given  auction  using  the  the 
Grey  System  Theory  [5].  Since  the  number  of  transactions  in  an  online  auction  con¬ 
tinually  increases,  the  process  of  monitoring  multiple  auctions  becomes  difficult  for 
ordinary  buyers.  Making  the  right  bid  becomes  a  challenging  task  for  an  ordinary  bid¬ 
der.  Hence,  knowing  the  closing  price  of  a  given  auction  is  an  advantage.  This  infor¬ 
mation  is  useful  and  can  be  used  to  ensure  a  win  in  a  given  auction.  Our  research  can 
provide  additional  information  on  the  existence  of  shill  bidders  that  will  improve  their 
prediction  accuracy. 
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Fig.  1.  Automatic  bidding  and  human  bidding 


3  Semi-automatic  Identification  of  Shill  Bidders 

In  this  section  we  first  explain  the  motivation  behind  the  proposed  method  and  then 
explain  the  proposed  method  in  detail. 

3.1  Identifying  Outliers  as  Sliill  Bidders 

It  is  assumed  that  shill  bidders  change  their  tricks  every  day.  Thus,  any  signature-based 
method  that  relies  on  prior  knowledge  obtained  using  a  supervised  learning  method  has 
a  problem,  since  such  a  method  requires  manual  labeling.  A  method  that  ean  distinguish 
new  tricks  automatically  is  required.  To  automate  the  detection  of  new  tricks,  we  use  an 
unsupervised  learning  method,  namely  a  one-class  SVM.  The  idea  behind  this  is  that 
.shill  bidders  are  outliers  and  their  behavior  differs  from  that  of  ordinary  bidders. 

After  the  one-elass  SVM  distinguishes  outliers,  the  outliers  are  further  analyzed  us¬ 
ing  deeision-tree  learning  method  C4.5.  Sinee  bidders  in  an  online  auetion  are  ordi¬ 
nary  people,  innocent  outliers  always  exist.  To  differentiate  innoeent  outliers  from 
shill  bidders,  we  use  a  manual  proeess  to  eheek  the  results.  Data  classified  into  each 
edge  node  of  the  obtained  decision  tree  is  examined  manually. 

We  end  up  modifying  the  class  label  of  some  nodes  to  “shill  bidder”  while  modify¬ 
ing  the  class  label  of  other  nodes  to  “innoeent  outlier.”  The  modified  decision  tree 
will  be  used  as  the  final  decision  tree  for  finding  shill  bidders.  Although  using  C4.5 
requires  a  manual  process,  the  preceding  one-elass  SVM  can  issue  warnings  concern¬ 
ing  new  tricks. 

3.2  Details  on  Finding  Shill  Bidders 

Figure  2  indicates  the  dataflow  inside  the  auction  system  and  an  outline  of  the  pro¬ 
posed  method. 
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Fig.  2.  Proposed  method 


First,  information  about  the  items  currently  under  auction  is  collected  from  the  auc¬ 
tion  site  (1).  The  bidding  histories  of  the  bidders  for  the  items  are  then  acquired  (2). 
The  proposed  method  also  collects  the  ratings  of  the  bidders  identified  in  the  bidding 
history  as  well  as  the  bidding  history  (5)  for  items  that  the  bidder  tried  to  purchase  in 
earlier  auctions.  Here,  a  similarity  between  bidding  histories  is  the  main  source  of 
information  for  confirming  the  ratings  of  the  bidders. 

Table  1.  Bidder  Attributes 


1 

User  ID  of  bidder 

2 

Number  of  successful  bids  placed  by  the  bidder  during  the  past  three 
months. 

3 

Number  of  times  the  bidder  was  rated  as  “bad”  in  past  auctions. 

4 

Number  of  times  the  bidders  were  ranked  by  other  participants. 

5 

Number  of  participants  who  ranked  the  bidder. 

6 

Ratio  of  the  most  frequent  party  for  the  bidder. 

7 

Ratio  of  the  second  most  frequent  party  for  the  bidder 

8 

Total  number  of  bids  made  by  the  bidder  during  the  past  three  months. 

9 

Average  increase  in  bids. 

10 

Average  of  bidding  duration  from  the  preceding  bids. 

1 1 

Rate  at  which  the  amount  of  an  additional  bid  exceeded  100%. 

12 

Rate  at  w  hich  the  amount  of  an  additional  bid  was  less  than  100% 

13 

Rate  at  which  the  amount  of  an  additional  bid  was  exactly  100% 
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Tabic  1  lists  the  attributes  acquired  through  the  above  process.  Attributes  2  through 
5  are  extracted  from  the  rating  information,  and  attributes  6  through  13  are  extracted 
from  bidding  histories.  Based  on  these  attributes,  the  one-class  SVM  extracts  outliers. 
Since  these  attributes  represent  bidding  histories,  i.c  ,  bidders'  behavior,  the  one-elass 
SVM  can  find  bidders  whose  behavior  is  abnormal.  We  use  C4.5  to  extract  human- 
readable  rules  for  such  abnormal  bidders.  Since  abnormal  buyers  are  not  always  shill 
bidders,  we  manually  check  the  found  rule  to  verify  that  we  can  interpret  the  rule  to 
identify  shill  bidders. 


4  Experiment  Results 

For  the  experiments,  information  on  59,949  users  and  67,244  items  was  collected 
from  an  auction  site  [6].  We  selected  this  site  to  test  the  proposed  method.  This  site 
has  a  basic  mechanism  for  protecting  users  from  shill  bidders.  For  example,  bidders 
have  to  register  their  credit  card  numbers.  Credit  card  information  improves  the  trace¬ 
ability  of  the  transaction.  This  simple  registration  process  reduces  unfair  practices.  To 
exclude  noise  from  non-active  users,  we  only  analyzed  the  behavior  of  users  who  had 
participated  in  auctions  more  than  five  times. 

4.1  Generated  Rule  and  Classified  Buyers 

Figure  3  presents  the  decision  tree  generated  by  C4.5.  The  outliers  found  by  the  one- 
elass  SVM  arc  classified  using  the  tree  in  Fig.  3.  When  the  ratio  of  the  outliers  was 
set  to  be  less  than  1%,  we  can  define  the  final  tree  as  the  tree  that  classifies  shill  bid¬ 
ders  from  other  innocent  buyers.  The  tree  in  Fig.  3  is  generated  with  an  outlier  ratio  of 
0.5%. 

In  this  tree,  the  branches  ending  in  F  nodes  with  bold  outlines  seem  to  classify  shill 
bidders.  The  branches  ending  in  F  nodes  with  dashed  outlines  seem  to  classify  active 
innocent  buyers.  Although  this  tree  also  classifies  active  buyers  as  outliers,  the  inter¬ 
pretation  of  the  end-nodes  is  not  a  difficult  task  for  human  analysts. 

For  example,  the  branch  from  the  root  node  to  the  rightmost  F  node  with  a  bold 
outline  indicates  a  set  of  conditions  for  discriminating  shill  bidders.  Each  node  in  the 
branch  is  a  condition  for  the  discrimination.  It  first  checks  the  rate:  “(1  1)  the  rate  at 
which  the  amount  of  an  additional  bid  exceeded  100%”  (root  node).  If  the  rate  is  less 
than  or  equal  to  88.9%,  it  cheeks  the  subsequent  conditions,  such  as  “(7)  the  number 
of  successful  bids  placed  by  the  bidder  during  past  three  months”  and  “(10)  Average 
of  bidding  duration  from  the  preceding  bids.”  From  this  branch,  it  seems  that  informa¬ 
tion  such  as  the  bidding  duration  (10)  and  the  ratio  of  a  second  frequent  party  (11)  are 
important  for  identifying  shill  bidders.  A  short  bidding  duration  (<=  1 4.3)  seems  to 
indicate  the  possibility  of  an  automated  auction  agent,  and  the  ratio  of  the  second 
most  frequent  party  (>41.7%,  i.c.  the  buyer  has  only  two  parties)  seems  to  indicate  the 
possibility  of  forged  buy  ers. 

In  contrast,  the  branch  to  the  leftmost  F  node  with  a  dashed  outline  indicates  the 
conditions,  i.c.  (11),  (2)  and  (7),  for  identifying  active  innocent  buyers.  Here,  the  ratio 
of  the  second  most  frequent  party  (<=7.1%,  i.e.  the  buyer  has  many  parties)  seems  to 
account  for  the  activity  of  the  buyer. 
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Fig.  3.  The  rule  generated  by  C4.5 
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Since  the  one-elass  SVM  also  classifies  aetive  innoeent  buyers  as  outliers,  the  tree 
automatically  generated  by  C4.5  cannot  classify  shill  bidders  alone.  However,  it  is 
difficult  to  manually  check  all  buyers  in  an  auetion  without  the  help  of  such  a  tree. 
Since  the  numbers  of  buyers  who  are  classified  as  outliers  is  relatively  small,  the  pro¬ 
posed  method  can  make  the  task  of  finding  shill  bidders  easier.  Moreover,  the  deci¬ 
sion  tree  also  makes  interpreting  the  outlier  buyers  easier.  We  ean  interpret  the  tree 
itself  to  understand  the  nature  of  the  outliers  found. 

4.2  Detailed  Analysis 

Figure  4  plots  the  change  in  the  number  of  aetive  buyers  and  shill  bidders  found  by 
the  proposed  method.  The  horizontal  axis  represents  the  ratio  of  outliers.  We  change 
this  ratio  by  changing  an  input  parameter  for  the  one-class  SVM  program.  After  the 
one-elass  SVM  located  outliers,  a  decision  tree  created  by  C4.5  was  analyzed.  The 
number  of  active  buyers  and  shill  bidders  classified  by  the  tree  are  plotted  in  this  fig¬ 
ure.  As  seen  in  the  graph,  the  number  of  shill  bidders  does  not  change  radically.  In 
contrast,  the  number  of  aetive  buyers  does  change. 

Even  when  we  change  the  ratio  of  outliers,  the  group  of  buyers  classified  as  shill 
bidders  remains  relatively  stable.  Furthermore,  77.5 %  of  buyers  who  were  classified 
as  shill  bidders  by  the  proposed  method  actually  were  shill  bidders.1  Thus,  we  believe 
that  the  proposed  method  is  useful  for  identifying  shill  bidders. 


1  We  manually  checked  all  of  the  buyers  who  were  classified  using  the  proposed  method. 
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Fig.  4.  Detection  ratio  and  breakdown 


The  percentage  of  shill  bidders  is  less  than  we  expeeted.  T  he  auction  site  wc  ana¬ 
lyzed  requires  an  ID,  sueh  as  a  credit  eard  number,  in  order  to  create  an  account.  This 
simple  requirement  seems  to  make  the  registration  of  forged  buyers  more  difficult  and 
thus  contributes  to  the  decrease  in  the  number  of  shill  bidders. 

Another  important  result  of  our  work  is  a  rule  for  finding  active  innocent  users. 
The  ehange  in  the  outlier  rate  seems  to  control  the  activity  of  the  identified  “innocent 
active  buyers.”  We  can  use  this  result  for  marketing  purposes. 


5  Conclusion 

This  paper  proposes  a  method  for  detecting  "shill  bidders"  in  online  auctions.  It  first 
detects  outliers  using  a  one-class  SVM.  It  then  transforms  the  results  into  a  decision  tree 
using  C4.5.  The  experiment  results  demonstrate  that  we  ean  treat  the  resulting  rules  as 
rules  for  classifying  shill  bidders.  Therefore,  the  proposed  method  can  in  fact  detect  shill 
bidders  in  an  online  auction.  Specific  findings  of  our  research  are  as  follows. 

1.  When  the  outlier  ratio  for  the  one-class  SVM  is  set  to  around  0.01,  our  method 
generates  a  decision  tree  that  can  discriminate  shill  bidders  and  active  innocent 
buyers  from  ordinary  buyers. 

2.  The  informative  attributes  for  classifying  a  shill  bidder  are  the  ratio  of  the  most 
frequent  party  for  the  bidder,  the  ratio  of  the  second  most  frequent  party  for  the 
bidder,  the  average  bidding  duration  from  the  preceding  bids,  and  the  raising 
rate  of  any  additional  bids. 

3.  The  most  important  feature  of  the  proposed  method  is  its  ability  to  automati¬ 
cally  adapt  to  new  shill  bidder  behavior.  The  proposed  method  ean  classify  a 
shill  bidder  exhibiting  a  new  behavior  as  an  outlier.  The  generated  tree  can  help 
analyze  the  shill  bidder's  new  behavior. 

We  can  use  information  about  shill  bidders  for  various  purposes.  For  example, 
the  accuracy  of  the  priee  expectations  for  future  auctions  can  be  improved  with  this 
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information.  Also,  a  relative  absence  of  shill  bidders  can  be  cited  to  favorably  rate  the 

auction  site.  Managers  of  auction  sites  as  well  as  buyers  can  use  this  information  to 

decrease  their  risk. 
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Abstract.  Recently  automatic  system  management  has  attracted  much 
attention  on  mining  system  log  files  for  anomaly  detection,  diagnosis  and 
prediction.  An  important  problem  in  this  area  is  mining  hot  clusters  of 
similar  anomalies  for  system  management.  A  hot  anomaly  cluster  is  de¬ 
fined  as  a  largest-sized  group  of  similar  anomalies,  whose  similarity  sat¬ 
isfies  some  user-specified  constraints.  While,  some  major  anomalies  have 
common  symptoms  and  are  shared  by  several  hot  clusters,  these  clusters 
do  not  have  to  be  disjoint.  So  this  problem  could  not  be  easily  solved  by 
existing  clustering  algorithms,  such  as  A-ineans  and  EM  In  lliis  paper 
we  propose  a  novel  heuristic  clustering  algorithm,  named  Hot  Clustering 
(1IC),  for  mining  these  patterns.  The  key  idea  of  1IC  is  to  group  neighbor¬ 
ing  anomalies  into  hot  clusters  based  on  some  heuristic  rules.  To  validate 
onr  approach  we  perforin  the  experiment  on  bug  reports  from  Bugzilla 
database  by  A- means,  EM  and  HC.  The  experimental  results  show  that 
our  approach  is  both  efficient  and  effective  for  this  problem. 


1  Introduction 

Nowadays,  computing  systems  are  being  increasingly  difficult  to  monitor,  man¬ 
age  and  maintain.  There  is  an  urgent  need  for  automatic  and  efficient  approaches 
to  achieve  that  [1].  A  popular  approach  for  system  management  is  based  on  an¬ 
alyzing  system  log  files  that  are  stored  in  structured  or  unstructured  text  forms. 
However,  it  is  costly  for  system  managers  to  deal  with  such  a  large  data  set. 
Moreover,  log  files  are  generated  by  a  number  of  different  corporate  systems, 
thus  the  emphasis  and  wording  vary  considerably,  i.e.,  anomalies  that  are  truly 
about  the  same  problem  of  the  system,  may  be  described  in  different  ways  by 
different  authors,  at  varying  times  and  under  varying  conditions  [2].  Thus  the 
effective  discovering  of  hot  clusters  of  similar  anomalies  for  system  management 
constitutes  our  most  urgent  problem. 

To  automatically  discover  hot  anomaly  clusters,  different  types  of  anomalies 
must  be  separated  while  similar  anomalies  must  be  grouped.  Thus,  we  can  use 
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traditional  clustering  algorithms,  which  take  text  reports  as  input  and  automat¬ 
ically  group  each  report  into  a  single  type.  However,  as  the  majority  of  anomaly 
clusters  have  very  small  size  while  only  a  few  ones  have  large  size,  traditional 
algorithms  arc  not  effective  and  efficient  enough  to  discover  these  large  clusters 
with  satisfactory  similarity.  Moreover,  since  some  major  anomalies  are  common 
symptoms  and  are  shared  by  several  hot  clusters,  these  clusters  may  be  joint 
with  each  other.  Therefore,  this  problem  could  not  be  easily  solved  by  existing 
clustering  algorithms,  such  as  fc-meaiis  and  EM.  It  is  necessary  for  us  to  explore 
new  methods  to  the  detection  of  hot  anomaly  clusters. 


2  Related  Works 

Recently  automatic  system  management  has  attracted  much  attention  on  mining 
system  log  files  for  anomaly  detection,  diagnosis  and  prediction  [1,3. 4. 5].  One  of 
the  key  issues  is  to  group  similar  anomalies  in  system  log  files.  Tao  Li  et.  al  [1] 
apply  text  mining  techniques  to  categorize  message  in  log*  files  into  common 
situations,  and  build  an  integrated  framework  of  heterogeneous  logs  for  system 
management.  Zhcminin  Li  et.  al  [3]  classify  bug  reports  into  different  categories 
based  on  text  classification  and  information  retrieval  techniques.  It  focuses  on 
investigating  impacts  of  new  factors  on  software  errors  to  improve  software  de¬ 
sign,  development,  mid  so  oil.  Mike  Chen  et.  al  [4]  train  decision  trees  to  identify 
causes  of  failures  from  web  request  logs,  thus  diagnosing  failures  in  large  Inter¬ 
net  Sites.  Yingluiig  Liang  et.al  5]  exploit  different  classifiers  including  RIPPER, 
SVMs  and  nearest  neighbor-based  method  on  event  logs  from  IBM  Bine  Gcne/L, 
in  order  to  predict  failure  events  of  the  system. 

Different  with  these  works,  this  paper  focuses  oil  mining  hot  clusters  of  similar 
anomalies  for  system  management.  Importance  of  this  problem  has  been  enjoying 
a  growing  amount  of  attention.  In  [6]  Srivastava  et.al  discuss  four  clustering 
techniques  used  for  this  problem,  including  Amncaiis,  Saininon  mapping,  EM  and 
Spectral  clustering.  However,  this  problem  could  not  be  easily  solved  by  these 
traditional  methods.  As  the  anomalies  are  not  uniformly  distributed,  traditional 
algorithms  are  not  effective  and  efficient  enough  to  discover  hot  anomaly  clusters 
with  satisfactory  similarity.  Moreover,  since  some  major  anomalies  are  common 
symptoms  and  are  shared  by  several  hot  clusters,  these  clusters  may  be  joint 
with  each  other.  It  can  hardly  be  achieved  through  traditional  algorithms  which 
mainly  produce  strictly  disjoint  clusters.  Therefore,  in  this  paper  we  propose  a 
novel  clustering  algorithm,  Hot  clustering,  which  outputs  the  largest-sized  hot 
anomaly  clusters  and  allows  the  resultant  clusters  not  to  be  disjoint. 

The  proposed  algorithm  extends  classic  density-based  clustering  method  with 
adjustable  similarity  threshold  and  multi-class  clustering.  When  no  similarity 
threshold  is  set  and  strictly  disjoint  clusters  are  required,  our  algorithm  degrades 
to  classic  density-based  clustering  [7,8].  A  similar  work  [9]  by  Daxin  Jiang  et.  al 
proposes  a  density-based  hierarchical  clustering  method  to  cluster  gene  expres¬ 
sion  data.  Their  algorithm  builds  a  density  tree  by  summarizing  clusters  and 
dense  areas  to  explore  the  cluster  structure  of  a  data  set,  while  our  approach 
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groups  neighboring  anomalies  to  mine  hot  clusters  of  similar  anomalies  based  on 
some  heuristic  rules. 


3  Problem  Specification 


This  section  describes  the  problem  that  we  focus  on,  i.e.,  given  a  set  of  anomalies 
in  a  report  data  set ,  find  the  set  of  hot  clusters  of  similar  anomalies.  A  report 
data  set  is  denoted  by  D  =  (x\1X2-t . . . ,  xn),  containing  n  anomalies  from  .r\  to 
xn.  bach  anomaly  xt  is  represented  by  a  feature  vector  of  length  m,  where  x,  = 
(aiii ,  Xi'2r . . . ,  x,*m),  i  =  !,•••  ,  n.  A  feature  represents  a  keyword  to  distinguish 
from  different  anomalies  and  group  similar  anomalies.  Feature  selection  is  not 
only  important,  but  also  very  challenging,  mainly  due  to  the  inherent  difficulty 
of  the  problem  as  well  as  the  large  volume  of  the  text  data  set  [10].  The  feature 
value  Xjj  is  determined  by  TFIDF,  which  is  the  Term  Frequency  and  Inverse 
Document  Frequency  of  the  jth  feature  in  the  xth  document  report,  where  xtJ  = 
TF,(fj).  l»g(n/ DF(fj)). 

Similarity  of  any  two  anomalies  xp  and  xq  is  measured  with  a  distance  func¬ 
tion,  denoted  by  d(xp,xq).  Different  distance  functions  can  be  chosen  for  dif¬ 
ferent,  applications.  For  instance,  when  using  Euclidean  distance  L\ ,  we  have 
d(xp,Xq)  =  \'xv3  ~  xqj\-  For  an  anomaly  cluster  C  =  (xi.xz,  •  •  •  .x^)  with 

k  anomalies,  we  define  two  measures  ,  average  pairwise  distance  and  maximum 
pairwise  distance ,  to  calculate  similarity  of  the  whole  cluster. 

Definition  1  (Average  Pairwise  Distance).  Average  pairwise  distance  of  an 
anomaly  cluster  C  is  denoted  by  ad(C): 


ad(C)  = 


k2 


(1) 


Definition  2  (Maximum  Pairwise  Distance).  Maximum  pairwise  distance 
of  an  anomaly  (  luster  C  is  denoted  by  rnd(C): 


rnd(C)  =  max  d(.r, ,  x j ) ,  V.r  * .  Vxj  6  C  ( 2 ) 

Average  pairwise  distance  represents  the  average  similarity  between  anomalies 
within  a  cluster,  smaller  average  pairwise  distance  means  greater  average  simi¬ 
larity  and  more  satisfactory  hot  cluster.  Maximum  pairwise  distance  reflects  the 
minimum  similarity  among  all  anomalies  in  the  cluster,  smaller  maximum  painvi.se 
distance  means  greater  minimum  similarity  and  more  satisfactory  hot  cluster. 

Hot  Clustering  aims  at  finding  the  largest  and  most  similar  anomaly  clusters. 
Since  the  two  goals  are  incompatible,  a  tradeoff  approach  is  to  find  the  largest- 
sized  group  of  anomalies  satisfying  some  user-specified  similarity  constraints. 
Definition  3  formally  defines  a  hot  anomaly  cluster  H  with  three  user-specified 
parameters  , MaxVts ,  MaxDts  and  MinPts. 

Definition  3  (Hot  Anomaly  Cluster).  A  hot  anomaly  cluster  //  wrt. 
MaxVts ,  MaxDts  and  MinPts  is  a  non-empty  subset  of  D  satisfying  the  fol¬ 
lowing  con  (lit  ion s : 
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1.  ad(H)  <  MaxVts 

2.  rnd(H)  ^  MaxDts 

3.  || // 1|  >  MinPts ,  where  ||//||  is  the  size  of  ff 

4 •  Vp  €  (D—H),  if  H*  =  (//-bp),  then  either  ad(H*)  >  MaxVts  orrnd(H*)  > 
MaxDts 

For  a  hot  anomaly  cluster,  its  average  pairwise  distance  should  be  no  more  than 
MaxVts ,  maximum  pairwise  distance  should  be  no  more  than  MaxDts ,  and 
cluster  size  should  be  no  less  than  MinPts.  The  last  condition  is  an  extension 
of  the  third  condition,  which  guarantees  that  the  cluster  should  contain  as  many 
anomalies  as  possible. 

4  A  Novel  Hot  Cluster  Discovery  Algorithm 

4.1  Heuristic  Rules 

Given  the  hot  cluster  parameters  MaxVts.  MaxDts  and  MinPts ,  a  hot  anomaly 
cluster  can  be  discovered  in  a  two-step  approach.  First,  choose  an  arbitrary 
anomaly  from  the  data  set  as  a  seed.  Second,  expands  it  repeatedly  until  no 
more  anomalies  could  be  added.  An  important  question  is  how  to  efficiently 
expand  a  seed  anomaly  to  a  hot  cluster.  To  this  ends,  two  heuristic  rules  are 
designed  based  on  notions  of  neighborhood  and  hot  degree  respectively. 

Definition  4  (Eps-neighborhood).  Eps -neighborhood  of  an  anomaly  X{  is  de¬ 
noted  by  NEpafai): 


NEp*{xi)  =  {arj|ar>  e  D  A  d(x{ ,  Xj)  <  Bps}  (3) 

The  Ep  s-neighboi  hood  of  an  anomaly  xt  contains  all  anomalies  within  Bps  dis¬ 
tance  away  from  Xi .  It  captures  the  neighbors  of  an  anomaly. 

Definition  5  (Hot  Degree).  The  hot  degree  of  an  anomaly  duster  C  is  de¬ 
noted  by  hd(C), where  ||C||  is  the  size  of  C: 

'■«  -  s  <4> 

The  hot  degree  of  an  anomaly  cluster  indicates  its  compactness,  which  aims  at 
reconciling  the  two  goals  of  HC,  A  cluster  with  higher  value  of  hot  degree  means 
it  contains  more  members  or  lias  high  average  similarity.  The  hot  degree  of  ail 
anomaly  Xi,  is  defined  by  the  hot  degree  of  its  Ep.s-neighborhood  in  the  following 
equation. 

hd(xi)  -  hd(NEps(xi)).  (5) 

Based  on  these  notions,  two  heuristic  rules  can  be  explored: 

Neighboring  Rule.  When  expanding  a  seed  anomaly,  give  priority  to  its  neigh¬ 
bors.  In  Figure  1(a)  for  example,  x 2  is  in  the  Eps- neighborhood  of  x\,  while 
#3,^4  are  out  of  the  Eps- neighborhood  of  X\ .  When  Eps  is  small,  we  could 
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still  have  the  Triangle  Inequality  Theorem  hold  despite  in  high-dimensional 
space,  i.e.: 

d{x 2, ®i )  <  d(: r3, .i*i )  <  r/(; r3, .1*2)  +  d(.r2, xq ) 

d(x*2,xi)  <  d{x4}X\)  <  d{x 4,a?a)  +  d(.x2,xi) 

If  d(x*:j,.T2)  <  d(x‘4,x*2),  then  the  establishment,  of  d(:ra,.ri)  <  d(./*4 , .Tj )  will 
have  a  greater  probability  than  d(x 3.^1)  >  d(x4 .  X\ )  . 

Hot  Degree  Rule.  When  choosing  seed  anomalies  for  expanding,  give  priority 
to  anomalies  with  higher  values  of  hot  degree.  I11  Figure  1(b)  for  example, 
the  ,/?ps-neighborbood  of  x \ ,  :c2 ,  X3 ,  are  denoted  by  N\ ,  Ar2 ,  N%  respectively. 
.r2  and  X3  are  both  in  iV).  Suppose  that  hd(xz)  >  hd(i r2),  i.e.: 

l!Af3||  .  l|Ar2|| 

a<Z(Ar3)  od(jV2) 

Suppose  that  the  anomalies  in  N\  is  very  intensive  and  uniformly  distributed, 
then  we  have  ad(N 3  U  N\ )  ~  ad(Ns)t  «d(Ar2  U  Ari )  ~  ad(A2).  Then  we  could 
have 

||  A'i  U  A3) ||  ||A’,UA-2|| 

ad(Ni  U  A3)  ad(N]  u  N2) 


(a)  Neighboring  Rule 


(b)  Hot  Degree  Rule 


Fig.  1.  Example  of  heuristic  rules 


4.2  The  Algorithm 

The  above  two  heuristic  rules  form  the  foundation  of  t  he  process  of  hot  anomaly 
cluster  discovery.  The  goal  of  HC  is  to  gather  as  many  anomalies  as  possible 
within  a  user-specified  similarity  threshold.  A  greedy  approximation  heuristic' 
algorithm  is  applied  which  starts  from  a  seed  anomaly  and  then  iteratively  ex¬ 
pands  to  a  hot  cluster  based  on  the  two  heuristic  rules.  Repeating  this  process 
for  all  anomalies  in  the  data  set  will  generate  all  hot  anomaly  clusters. 

With  respect  to  the  three  parameters  in  Definition  3.  MaxVts  and  MuxDt. s 
are  of  most  important,  while  M  in  Pis  does*  not  remarkably  affect  clustering  re¬ 
sults  when  being  set  in  a  moderate  range.  Algorithm  1  also  introduces  two  other 
parameters  of  Eps  and  Pts  to  control  the  searching  space.  Eps  is  used  for  neigh¬ 
bor  finding,  which  is  set  as  the  maximum  distance  of  acceptable  neighbors.  Pts  is 
used  for  seed  anomaly  finding,  which  is  set  as  the  'minimum  number  of  neighbors 
for  seed  anomalies.  All  these  parameters  are  detailed  and  analyzed  in  section  4.2. 
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In  algorithm  1,  Eps,  Pis  and  E’ps-neighborhood  for  all  anomalies  are  initial¬ 
ized  first.  If  the  anomaly  is  a  seed  anomaly,  expand  it  from  its  neighborhood, 
otherwise,  ignore  it.  When  expanding  a  seed  anomaly  x ,  to  find  a  hot  cluster, 
first  add  all  anomalies  in  its  neighborhood  to  Qc>  in  which  anomalies  are  ordered 
by  their  hot  degree.  Second  pop  the  highest  hot  degree  anomaly  out  of  Qc  as  the 
new  seed.  This  step  of  seed  expanding  aims  at  finding  neighbor's  neighbor  for 
the  hot  cluster  of  This  is  achieved  through  a  two-step  approach.  First  drop 
those  anomalies  if  their  maximum  pairwise  distance  from  the  original  hot  cluster 
exceeds  MaxDts.  Second,  add  all  left  anomalies  if  the  average  pairwise  distance 
of  the  enriched  new  cluster  docs  not  exceed  MaxVts.  All  these  newly  added 
anomalies  will  be  stored  in  Qn,  as  candidate  anomalies  for  future  expanding. 
All  anomalies  in  the  hot  anomaly  cluster  started  from  xt  will  be  saved  in  Nc(xj). 
When  Qc  is  empty  but  Qn  is  not  empty,  add  all  the  candidate  anomalies  in 
Qn  to  Qc •  This  is  achieved  by  checking  the  average  distance  constraint  until  alt 
possible  neighborhoods  are  examined.  As  a  result,  the  associated  hot  clusters  of 
all  anomalies  in  the  data  set  will  be  extracted. 


input 

:  Data 

set  D,  Average  distance  threshold  M axV t.s,  Maximum  pairwise  distance 

threshold  MaxDts 

output:  a  sot  of  hot.  clusters,  for  each  anomaly  Xi ,  saved  in  \Tc(xj) 

1.1  Initialize  neighborhood  radius  Eps .  seed  anomaly  density  Pis  , 

1.2  Initialize  N  Ep. 

*(xi)  for  all  anomalies  in  D' 

1.3  for  x, 

in  D,  i 

=  1  to  n  do 

1.4 

if  NEp  (; 

r,)||  >  Pts  then 

1.5 

Add  Nep„(x i)  to 

1.6 

Add  i\rEpfi(xi)  1°  Qc\ 

1.7 

Sort  Qc  by  anomaly  hot  degree, 

1.8 

while  Qc  is  not  empty  do 

1.9 

Pop  a  from  Qc\ 

1.10 

if  l|A;Fr*(a)ll  <  Pfs  then 

111 

continue', 

1.12 

end 

1.13 

4 

«  -  Nev»{«)  -  NcixiY 

1.14 

for  Xj  m  An  do 

1.15 

if  md(Nc(xi)  +  xj)  >  MaxDts  then 

1.16 

Remove  x j  from  A  a ; 

1.17 

end 

1.18 

end 

1.19 

if  ad(Nc(x,)  +  Ai)  <  MaxVts  then 

1.20 

Add  An  to  ISTc(xi); 

1.21 

Add  Aa  to  Qn\ 

1.22 

end 

1.23 

if  Qc  is  empty,  but  Qn  is  not  empty  then 

1.24 

Add  Qn  to  Qc\ 

1.25 

Empty  Qn  \ 

1.26 

Sort  Qc  by  anomaly  hot  degree ; 

1.27 

end 

1.28 

end 

1  29 

end 

1.30  end 

Algorithm  1.  Hot  Clustering  Algorithm,  HC 
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4.3  Improvements  and  Complexity 

The  computational  cost  of  HC  (’an  be  decomposed  into  two  parts. 

1. The  time  required  for  neighborhood  initialization. 

2.  The  time  required  for  hot  cluster  extracting. 

Neighborhood  initialization  intuitively  need  to  compare  the  anomalies  one-to- 
one  with  the  time  complexify  of  0(n2).  This  is  computationally  very  expensive, 
especially  for  large  data  set.  Thus*  we  need  to  partition  the  data  set  into  sev¬ 
eral  disjoint  subsets  and  initialize  the  neighborhoods  inside  these  subsets.  Al¬ 
gorithm  2  presents  the  partition  method,  in  which  related  anomalies  are  put 
together  to  automatically  separate  far  away  anomalies  into  different  subsets. 
There  are  two  parameters  for  partitioning  the  data  set.  ParVts  and  Par  Pis. 
ParVts  is  the  maximum  average  pairwise  distance  of  the  subsets,  while  Par  Pis 
is  the  minimum  size  of  the  subsets.  The  time  complexity  of  Algorithm  2  is  0(nk ), 
where  k  is  the  number  of  subsets. 

Hot  cluster  extracting  is  breadth-first  search  of  the  neighborhoods  with  lin¬ 
ear  complexity,  because  only  neighbor-reachable  anomalies  are  checked.  Thus 
the  most  costing  computing  is  actually  the  calculation  of  the  average  pairwise 
distance.  The  computing  could  reuse  existing  results  by  Equation  6.  where,  M 
is  the  existing  cluster  and  N  is  the  newly  added  anomaly  set  (M  and  N  are 
disjoint).  In  the  equation.  ad(M)  is  already  known,  and  the  newly  added  set  is 
much  smaller  than  the  cluster  size(||iVj|  <C  || M  T*  iV 1 1 ) .  So  the  time  required  for 
distance  computing  is  approximate  to  0(||A/||.||iVj|).  Additionally,  the  search¬ 
ing  space  of  a  certain  anomaly  is  actually  the  sum  of  all  neighbor-reachable 
anomalies  from  this  anomaly.  So  the  time  complexity  of  hot  cluster  extracting 


input 

:  Data  set  D.  Average  pairwise  distance  threshold  ParVts.  Density  threshold 

Par  Pis 

output*  Disjoint  subsets  P*  {  /’, } 

2, 1  for  rt 

m  D ,  i  =  1  to  ii  do 

2.2 

for  Pj  in  P* .  i .  =  1  to  rn  do 

2.3 

if  ad(Pj  -f  xt)  <  ad(Sp  +  .r,)  then 

2.4 

1  =  P,  ■ 

2.5 

end 

2.6 

end 

2.7 

if 

'  ad(Sp  +  Xj)  <  ParVts  then 

2.8 

1 

Add  Xj  to  Sp  ; 

2.9 

end 

2.10 

else 

2.11 

Create  a  new  subset.  Px  for  x,  ; 

2.12 

Add  Px  to 

2.13 

end 

2.14  end 

2.15  for  P, 

in  P*  f  i  =  1  to  m  do 

2.16 

if  jPt|]  <  ParPts  then 

2.17 

I 

Remove  /’,  fiom  P *  : 

2.18 

end 

2.19  end 

Algorithm  2.  Partitioning  Data  set  Based  on  Average  Pairwise  Distance 
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is  0(m.l.d ),  where  rn  is  the  seed  number,  /  is  the  size  of  the  sum  set  and  d  is  the 
maximum  size  of  the  cluster. 


ad(M  +  N) 


ad(M).\\M\\2  +  ar/(AT).||iV||2  +  2.D(Af,  N) 

\\M  +  Nf 


(0) 


To  sum  up,  the  time  complexity  of  neighborhood  initialization  is  Q(nk),  and  the 
time  complexity  of  hot  cluster  extracting  is  0(m.l.d).  As  the  majority  of  anomaly 
cluster  tend  to  have  very  small  size,  i.e.,  k  n  and  m  n ,  the  time  required 
for  HC  is  approximate  to  0(/.d),  which  are  largely  determined  by  neighborhood 
radius  Eps.  Thus,  the  selection  of  Eps  is  the  key  factor  affecting  the  performance 
of  the  algorithm,  which  will  be  detailed  in  section  4.3. 


5  Experiments 

Experiments  are  performed  using  bug  reports  from  Bugzilla  database  [1 1]  to  eval¬ 
uate  the  proposed  algorithm  of  HC  with  benchmark  algorithms  of  KM  and  EM. 

5.1  Data  Sources 

We  collected  our  data  from  an  on-line  large  open  source  software  project  of 
Mozilla,  which  contains  about  thirty-three  products,  including  Calendar,  Camino, 
Composer,  Firefox,  Thnnderbird,  Core,  Directory,  Toolkit.  Webtools,  websize, 
etc.  Each  product  has  a  number  of  bug  reports  in  Mozilla  Bugzilla  database. 
These  text  reports  are  individually  recorded  by  tens  of  thousands  of  on-line  users, 
including  volumes  of  similar  and  recurring  bugs.  They  particularly  address  the 
problems  of  the  number  and  similarity  of  bug  clusters.  Through  hot  clustering  of 
similar  bugs,  system  managers  can  easily  triage  the  anomalies,  recognize  system 
brittleness,  and  gain  high  level  evolutionary  information  for  system  development. 
Although  our  experiments  are  performed  on  bug  reports  in  Bugzilla  database,  it 
can  be  used  in  other  text  data  sources  where  high-dimensional  clustering  need 
to  be  applied  to  discover  hot  patterns  of  similar  topics  from  a  huge  amount  of 
historical  data. 

Each  bug  reports  in  bugzilla  databases  contains  the  following  attributes  of 
bug  ID,  summary,  time,  status,  reporter,  assignee,  severity,  bug  description,  dis¬ 
cussion  comments,  test  cases,  attachments,  and  activities,  etc.  We  use  all  of 
these  attributes  except  time,  status,  reporter,  assignee,  attachments  and  activ¬ 
ities,  which  are  of  little  relevance  for  hot  anomaly  clustering  while  hard  for 
current  automatic  analyzing  techniques.  Our  experimental  products  include  ad¬ 
dons.  mozilla.org,  Camino,  Calendar  and  Bugzilla  with  different  data  set  size  and 
feature  size,  as  shown  in  Tablet. 

Table  1.  Experimental  Data  Sets 


Product 

adclons.riiozilla.org 

Camino 

Calendar 

Bugzilla 

Item  Size 

2.818 

3.790 

0,666 

12,224 

Feature  Size 

984 

1,259 

1 ,694 

2,403 
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5.2  Evaluation  Method 

One  of  the  most  important  issues  in  hot  anomaly  cluster  discovery  is  the  evalua¬ 
tion  of  clustering  results.  In  general,  there  are  two  criteria  to  investigate  cluster 
validity  [12]. 

1 . Compactness,  the  members  of  each  cluster  should  be  as  close  to  each  other 
as  possible. 

2. Separatum,  the  clusters  themselves  should  be  widely  spaced. 

Our  study  addresses  the  problem  of  discovering  hot  clusters  of  similar  anomalies. 
Since  some  major  anomalies  are  common  symptoms  and  are  shared  by  several 
hot  clusters,  these  clusters  do  not  have  to  he  disjoint.  Thus  we  could  only  use 
the  compactness  criteria  to  to  evaluate  different  hot  clustering  algorithms.  The 
compactness  of  a  hot  anomaly  cluster  could  be  described  in  terms  of  Cluster 
Size,  Average  Painvi.se  Distance  and  Maximum  Pairwise  Distance  as  follows: 

-Cluster  Size ,  ||C||,  how  many  members  the  cluster  contains  larger  size  means 
more  satisfactory  hot  cluster. 

- Average  Pairwise  Distance  ,  ad(C),  defined  in  Definition  1,  smaller  value 
means  greater  average  similarity  and  more  satisfactory  hot  cluster. 

- Maximum  Pairwise  Distance ,  md(C),  defined  in  Definition  2,  smaller  value 
means  greater  minimum  similarity  and  more  satisfactory  hot  cluster. 

To  investigate  the  performance  of  different  clustering  algorithms,  we  define 
two  kinds  of  hot  clusters:  largest  hot  clusters  and  similar  hot  clusters.  Sup¬ 
pose'  that  the  hot  cluster  set  found  by  KM,  EM  and  HC  are  denoted  by  K , 

E  and  77  respectively,  where  K  —  (K\ ,  Ki . 7\m).  E  —  (E\ .  E^i . . . ,  En). 

II  (H\ ,  77-2, . . . ,  Hi). 

-Ixngest  Hot  Clusters,  the  group  of  largest  hot  clusters  is  defined  by  a  triple 

<  I\l,Ei  ,  Hjj  >.  where  K /  is  the  largest  cluster  in  K ,  Ei  is  the  largest  cluster 
in  E,  and  Hi  is  the  largest  cluster  in  II. 

- Similar  Hot  Clusters  ,  a  group  of  similar  hot  clusters  is  defined  by  a  triple 

<  I\'s,Es  Hs  >,  which  should  satisfy  ||  h's  n  Es  H  II  s  ||  >  70%, 

min  {  ||  Ks  ||*  II  Es  ||i  ||  IIs  ||  }•  That  is  to  say,  the  similar  hot  clusters 
should  contain  at  least  70%  same  members  with  the  smallest  cluster. 


5.3  Comparison  of  the  Three  Algorithms 

The  threw  clustering  algorithms  are  compared  over  all  four  products  listed  in 
Table  1.  All  algorithms  are  implemented  in  Java  and  all  tests  were  performed 
under  the  same  circumstance.  Regardless  of  different  distance  functions  in  the 
clustering  algorithm,  a  uniform  distance  function  is  used  to  measure  the  similar¬ 
ity  of  result  anomaly  clusters.  The  distance  of  any  two  anomalies  in  t  he  result 
cluster  is  computed  by  d{xv,xq)  =  =  i  \B(xpj)  ~  H(x,u)\.  where, 

J  1,  if  feature  fj  occur  in  report  document  i  , 

HyPij)  =  \  n  *  i  •  v i ) 

I  0,  otherwise 
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The  cluster  number  for  KM  and  EM  is  got  by  testing  a  series  of  values  to 
obtain  the  largest  and  most  similar  sets  of  anomalies.  And  the  selected  values 
are  1.400,  2,500.  3.500,  and  3,000  for  addons.niozilla.org,  Camino,  Calendar  and 
Bugzilla  respectively.  Additionally,  noisy  clusters  should  be  filtered  out,  whose 
maximum  pair  distance  beyond  GO  or  average  pairwise  distance  beyond  35.  As 
for  HC,  we  employ  the  same  value  for  all  products:  the  neighborhood  radius  is 
8,  the  maximum  pairwise  distance  is  20,  and  the  average  pairwise  distance  is  10. 
We  eliminate  the  parameter  MinPts  by  setting  it  to  2  for  all  products. 

The  running  time  for  the  three  clustering  algorithms  is  shown  in  Figure  2. 
From  Figure  2.  it  is  observed  that  HC  is  much  faster  than  the  other  two  clustering 
algorithms.  A  possible  reason  is  that,  the  data  set  is  very  intensive  in  some  places, 
but  very  sparse  in  most  places.  KM  and  EM  need  to  group  all  the  anomalies 
into  different  types,  while  HC  only  need  to  find  hot  clusters  in  these  intensive 
places. 


Fig.  2.  Running  Time  of  KM,  EM  and  HC 


Table  2  presents  the  evaluation  results  of  the  largest  hot  clusters  for  the  three 
clustering  algorithms  of  KM,  EM  and  HC.  The  size  of  largest  hot  cluster  in  HC 
is  close  to  that  in  EM.  while  the  average  distance  and  maximum  distance  are 
much  smaller.  Additionally,  the  size  of  largest  hot  cluster  in  KM  is  much  smaller 
than  the  other  two  algorithms,  especially  in  large  data  sets.  In  evaluation  of  the 
largest  hot  clusters,  I1C  outperforms  KM  and  EM  on  both  cluster  similarity  and 
cluster  size. 

Table  3  evaluates  the  three  clustering  algorithms  with  two  groups  of  exam¬ 
ple  similar  hot  clusters.  Examples  for  addons.mozilla.org  show  that  HC  produces 


Table  2,  Largest  Hot  Clusters  of  KM,  EM  and  HC 


Product 

Algorithm 

II CH 

od(C) 

md(C) 

Product 

Algorithm 

unr 

ad{C) 

rnd(C) 

addons. 

mozilla. 

org 

KM 

28 

25.76 

51 

Calendar 

KM 

37 

15.29 

51 

EM 

47 

30.28 

53 

EM 

62 

13.91 

28 

HC 

43 

9.08 

15 

HC 

136“ 

20 

C&mino 

KM 

11 

28.22 

56 

Bugzilla 

KM 

26 

10.84 

24 

KM 

25 

31.33 

52 

EM 

352 

13773 

44 

HC 

64 

9.94 

16 

ffC 

358 

10.0 

19 
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much  more  similar  anomaly  clusters  than  KM  and  EM.  while  maintaining  enough 
(or  even  the  same  number  of)  members  in  the  cluster,  hi  examples  for  Caniino 
and  Calendar,  HC  gathers  more  anomalies  than  KM  and  EM,  given  similar  aver¬ 
age'  pairwise  distance  and  maximum  pairwise  distance.  In  examples  for  Bugzilla. 
HC  outperforms  KM  and  EM  on  both  cluster  similarity  and  cluster  size. 


Table  3.  Two  Groups  of  Example  Similar  Hot  Clusters  of  KM,  EM  and  HC 


Product 

Group 

Algorithm 

lie'll 

ad(C) 

mri(C) 

Product 

Group 

Algorit  hm 

FT 

nd(C) 

m  d(C) 

addons. 

mozilla. 

org 

1 

KM 

8 

2G.5 

51 

Calendar 

1 

KM 

7 

4.G7 

9 

EM 

f> 

"l  6.M 

35 

FTM 

2 

6 

6 

HC 

4 

0 

0 

HC 

7 

4.67 

9 

II 

K\t 

18 

4.14 

25 

11 

FTM 

3 

3  33 

4 

EM 

17 

4.26 

25 

EM 

4 

15.67 

29 

HC 

17 

2.0G 

to 

HC 

I 

7 

11 

Camino 

1 

KM 

3 

9.33 

14 

llng/.illa 

1 

KM 

G 

7.67 

13 

ETvT 

2 

ft 

0 

EM 

G 

7.67 

1ft 

HC 

3 

4 

G 

HC 

5 

5.4 

7 

11 

KX1 

3 

2.G7 

4 

II 

KM 

5 

15 

20 

EM 

3 

2.67 

4 

EM 

ft 

10.17 

23 

HC 

G 

3.2 

8 

h  HC 

10 

7.2 

14 

Table  4  shows  the  result  bug  reports  in  the  first  group  of  example  simi¬ 
lar  hot  clusters  in  Table  3.  Only  bug  ids  and  bug  summaries  are  included, 
and  detailed  description  of  a  bug  report  with  bug  id  i  can  be  obtained  from 
lit tps://bugzilla. mozilla.org/show -.bug.cgi?id=i.  Bug  reports  found  by  all  three 
algorithms  are  included  in  the  row  of  UKM ,  EM,  HC  intersect” ,  while  those 
found  by  only  one  or  two  of  the  three  algorithms  are  included  in  the  row  of  the 
algorithm  particular  (e.g.  'HIC  particular”). 

For  the  similar  hot  (  lusters  in  Table  4.  the  intersected  sets  of  bug  reports 
describe  the  same  problems  with  the  same  summaries.  In  the  example  of  ad¬ 
dons.  mozilla.org,  both  KM  and  EM  find  some  bug  reports  different  from  tin* 
problem  described  by  the  intersected  sot.  while  1 1C  only  contains  those  repre¬ 
senting  the  intersected  problem  of  “Update  image  Zoom”  Extension” .  Take' 
the  bug  report  246851  found  by  KM  particularly  as  an  example,  though  it  sot' ins 
much  like  the  intersected  problem,  it  is  in  fact  about  another  problem  of  “Text 
overlaps  badh  when  zoomed  to  200%".  In  the  example  of  Camino.  EM  only 
finds  the  problem  of  “AAHIG  -  Open  Dialog"  described  by  the  intersected  set. 
while  KM  and  HC  find  Bug  188042  and  188041  respectively,  whose  summary  also 
contain  keywords  of  “AAHIG  -  Open  Dialog”.  By  in  depth  analysis  of  this  two 
particularly  found  bugs,  the  one  found  by  KM  describes  another  problem  of  "add 
application’s  name  to  open  dialog  title”,  while  the  one  found  by  HC  describes 
the  same  problem  as  the  intersected  one,  which  is  about  “support  document 
preview  and  multiple  selection”.  For  the  example  of  Calendar,  HC  performs  as 
good  as  KM,  and  EM  misses  five  bug  reports.  In  the  example  of  Bugzilla.  similar 
as  the  example  of  addons.moziIla.org,  KM  and  EM  improperly  find  bug  297791 
which  is  obviously  different  from  the  problem  described  by  the  intersected  set.. 
These  four  examples  prove  that  HC  can  find  more  accurate  and  larger  size  of 
hot  clusters  of  similar  bug  reports  than  KM  and  EM. 
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Tabic  4.  Resultant-  Bug  Reports  in  the  first  group  of  Similar  Hot  Clusters 


Product 

Algorithm 

Bngld 

Bug  Summary 

Product 

Algorithm 

Bllgld 

Bug  Summary 

KM 

251210 

Update*  “Image  Zoom” 

KM,  EM, 

crash  if  1  close 

EM 

254074 

Extension 

HC 

325295 

the  mail  window 

HC 

258601 

intersect 

341622 

while  checking 

intersect 

258662 

for  new  mail 

addons. 

mozilla. 

KM 

246851 

Update  is  not  friendly 

to  text  zoom  (200%) 

Calendar 

335899 

org 

particular 

219413 

Add  nnagezoom  extension 

KM 

338525 

crash  if  1  close 

285749 

The  Image  Zoom 

HC 

341607 

the  mail  window 

347716 

New  links  under  .  .  . 

particular 

348422 

while  checking 

EKl 

249413 

Add  imagezoom  extension 

376313 

for  new  mail 

particular 

285719 

The  Image  Zoom 

KM,EM,HC 

187773 

AAH1G  -  Open  Dialog 

KM 

313122 

implement 

intersect 

187776 

EM 

313123 

validations 

KU 

188042 

AAHIG  -  Open  Dialog 

HC 

313125 

and  database 

Carnino 

particular 

title  as”  Navigator  Open” 

Bugzilla 

intersect 

313126 

persistence 

...  HC  -  - 

188041 

AAHIG-Open  Dialog 

313129 

functions 

particular 

support  multiple  selection 

JcXLLM" 

paticular 

297791 

All  instances 

should  have  . . . 

G  Conclusions 

In  this  paper,  wc  formulate  the  problem  of  mining  hot  clusters  of  similar  anoma¬ 
lies  for  system  management.  We  show  that  this  is  not  an  easily-solved  problem 
by  the  existing  clustering  algorithms.  We  propose  a  new  heuristic  density-based 
algorithm  IIC  to  solve  this  problem.  The  key  idea  of  HC  is  to  group  neighboring 
anomalies  into  hot  clusters  based  on  some  heuristic  rules.  The  experimental  re¬ 
sult  show  that  our  approach  is  robust,  more  efficient  and  effective  than  fc-means 
and  EM  for  this  problem.  We  believe  that  the  IIC  algorithm  will  greatly  help 
the  system  management. 
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Abstract.  This  paper  develops  a  model  for  short-term  prediction  of 
time  series  based  on  Element  Oriented  Analysis  (EOA).  File  EOA  model 
represents  nonlinear  changes  in  a  time  series  as  strata  and  uses  these  in 
developing  a  predictive  model.  The  strata  features  list'd  by  the  EOA 
model  have  the  potential  to  improve  its  forecasting  performance  on  non¬ 
linear  data  relative  to  the  performance  of  existing  methods.  We  demon¬ 
strate  the  characteristics  of  the  EOA  model  using  an  empirical  study 
of  stock  indices  from  eight  major  stock  markets.  The  study  provides 
comparisons  of  the  accuracy  and  time  efficiency  between  AR1MA,  Neu¬ 
ral  Networks  and  the  EOA  model.  Our  findings  indicate  that  the  EOA 
model  is  a  promising  approach  for  short-term  time  series  prediction. 

Keywords:  Short-term  Prediction,  Time  Series,  Element  Oriented 
Analysis. 


1  Introduction 

Short-term  prediction  in  time  scries  has  had  significant  practical  applications 
across  different  domains  in  recent  years.  For  example.  Wild  [18]  contributed  a 
method  for  accurate  short-term  forecasting  of  traffic  volume  time  series.  His  ap¬ 
proach  achieved  satisfactory  predictions  of  traffic  volumes  at  road  intersections. 
Darbellay  and  Slama  [4]  forecast  short-term  electricity  demand  using  existing 
time-series  methods.  Furthermore,  Gorr  and  his  colleagues  t7]  extended  the  topic 
to  the  short-term  forecasting  of  crimes.  Their  results  provide  a  novel  approach 
with  applications  in  the  prevention  of  potential  crimes. 

Many  methods  for  short-term  time  .series  prediction  have  been  reported  in 
the  academic  literature.  Those  methods  include  Exponential  Smoothing  [16], 
GARCH  [2],  AR1MA  [1]  and  Neural  Networks  [8],  to  name  a  few  In  terms  of 
practical  implementations  and  applications  in  the  time  series  domain,  AIIIMA 
and  Neural  Networks  are  considered  to  be  two  mainstream  models  for  short-term 
prediction  [5], [15]. 

The  technical  limitations  of  the  A  RIM  A  and  Neural  Network  models  have 
been  discussed  in  the  literature  [4], [6], [10], [14], [19].  A  key  limitation  of  ARIMA 
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models  is  that  the  assumptions  of  linear  and  stationary  time  series  are  insufficient 
in  many  real-world  applications  [6].  Furthermore,  it  is  a  difficult,  task  to  build  an 
AH  IMA  model  because  it  requires  good  domain  knowledge  in  the  specific  area 
of  its  application  [10]. 

Unlike'  ARIMA  models.  Neural  Networks  do  not  have  the  linearity 
assumption  G]  However,  Neural  Networks  can  also  be  difficult,  to  configure  (and 
train)  successfully.  Darbellay  and  Slama  [4]  have  concluded  that  Neural  Networks 
had  no  established  procedure  for  identifying  the  optimal  network  structures.  A 
related  issue  is  that  Neural  Networks  have  the  potential  of  overfitting  leading 
to  inaccurate  predict  ions  [13].  Therefore,  the  application  of  Neural  Networks  of¬ 
ten  involves  a  potentially  time  consuming  empirical  trial- and-error  approach  to 
obtain  an  accurate  prediction  [4]. 

I  he  main  idea  behind  onr  stratified  predictive  model  is  to  discover  and  repre¬ 
sent  the  dynamical  nonlinear  changes  in  a  given  time  series  and  utilize  them  to 
assist  forecasting  through  using  Element  Oriented  Analysis  (EOA)  proposed  by 
Zhang  et  al  [21].  In  this  paper,  we  investigate  if  EOA  based  stratified  model  im¬ 
proves  short-terrn  prediction  performance  relative  to  the  linear  ARIMA  models 
while  requiring  considerably  less  training  effort  than  the  application  of  Neural 
Networks  in  time  series  analysis. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  introduces  the  basic 
ideas  behind  the  EOA  model  based  on  a  simple  example  of  time  series.  Then 
we  address  the  framework  of  the  EOA  model  and  building  of  the  strata  for  time 
series  prediction  in  Section  3.  In  Section  4.  the  relative  performance  of  the  EOA 
model  is  demonstrated  by  an  experimental  study  on  the  time  series  indices  from 
eight  major  stock  exchange  markets.  In  particular,  we  compare  the  accuracy  and 
time  efficiency  between  the  results  of  the  ARIMA  model  Neural  Networks  and 
the  EOA  model.  The  last  section  summarises  the  paper  s  findings  and  discusses 
future  work  directions. 

2  Element  Oriented  Analysis 

EOA  is  a  methodology  for  developing  predictive  models  and  not  an  algorithm. 
The  EOA  methodology  involves  the  design  of  new  features  or  attributes  based 
on  a  segmentation  of  the  original  data  The  initial  idea  of  EOA  lias  been  proposed 
and  partially  used  to  predict  corporate  bankruptcy  by  Zhang  et  al  [21];  we  omit 
a  detailed  explanation  of  the  EOA  model  in  this  paper.  In  that  application, 
the  data  was  segmented  and  the  segment  characterist  ics  were  used  to  add  new 
informative  features. 

In  the  time  series  application  reported  in  this  paper,  the  time  series  train¬ 
ing  data  is  segmented  into  strata.  Informative  features  are  then  extracted  from 
these  strata  and  used  in  an  A utoR egression  model.  The  most  critical  aspect  of 
the  application  of  EOA  to  time  series  prediction  is  how  to  choose  the  elements. 
The  following  example  gives  the  definition  of  two  elements  that  we  will  use  later. 
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Those  elements  are  said  to  represent  a  latent  relationship  within  a  given  dataset 
in  terms  of  certain  intrinsic  properties. 

Example  1.  Suppose  that  a  given  dataset  consists  of  ten  observations  with  one 
binary  target  variable  (K)  and  one  numeric  explanatory  variable  (Ar)  as  follows: 

Y  :  0,  1,  Ml  1,  1,0,0,  1,0, 

X  :  7,  3,  4,  6,  3,  4,  7,  G,  4,  7, 

Suppose  that  the  study  objective  from  the  analysis  of  this  dataset  is  to  find 
the  relationship  explaining  what  kind  of  X  is  more  likely  to  cause  the  case  of 
either  Y  =  0  or  Y  =  1.  According  to  Definition  1  some  intrinsic  properties  arc 
extracted  into  the  new  informative  features.  One  of  the  simple  ways  to  do  this 
is  to  discover  the  horizontal  and  vertical  intrinsic  properties  within  the  dataset 
as  shown  in  Figure  1. 


Y 

0 

110110010 

x 

7 

3  4  6  3  4  7  6  4  7  <"■'?) 

H  1 
V,l 

V.I.P 

P  H 

.P  ,v. 

orizontal  Intrinsic  Property 
ertical  Intrinsic  Property 

Fig.  1.  Two  Intrinsic  Properties  in  the  Dataset 


In  the  figure,  the  vertical  intrinsic  property  refers  to  the  variance  of  X  in  terms 
of  the  0/1  change  of  Y.  In  addition,  the  horizontal  intrinsic  property  shows  the 
status  of  each  single  data  point  within  the  entire  A"  that  should  be  discovered. 
Therefore,  we  define  two  elements.  Element  s\  represents  the  probability  of  ex¬ 
planatory  variable  given  the  occurrence  of  the  target  variable  Y .  Wc  use  the 
following  function  (1)  to  describe  the  element: 

=  P(  X  |  Y  =  l)  or  Sl  =  P(X\Y  =  0)  (1) 

where  P  ( X  \  Y)  is  the  conditional  probability  of  Ar  when  Y  occurs. 

Another  element  .s<2  is  used  to  depict  the  dataset  from  the  viewpoint  of  the 
observations  across  all  attributes.  As  a  result,  $2  states  the  possibility  of  the 
overall  partition  for  the  observation  between  Y  =  l  and  Y  =  0.  For  example, 
we  might  use  a  clustering  algorithm  based  011  a  distance  matrix  to  calculate  the 
belongingness  possibility  by  the  following  function  (2) 

1  1 

— - - —  or  so  =  — o - 7- 

y^2  <±l  y^2  dx 

^j=i  dj  dj 


$2  = 


(2) 
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where  d] and  (I2  represent  the  distance  from  the  observation  to  cluster  for  Y  =  0 
and  Y  =  1.  According  to  functions  (2)  and  (3),  we  obtain  two  elements  to  replace 
the  original  explanatory  variables  as  shown  below 

Y  :  0,  I,  1,  0,  I,  1,  0,  0,  1.  0 

sx  :  0.6,  0-  0,  0.4,  0,  0,  0.6.  0.4,  0,  0.6 

«s'2  :  0.1 1,  0.8G,  0.87,  0.2,  0.86,  0.87.  0.11.  0.2,  0.87,  0.11 

In  the  above  example,  the  two  elements  s\  and  $2  are  said  to  reveal  the  latent 
structure  between  the  target  variable  (Y)  and  the  original  explanatory  variable 
(X).  Furthermore,  the  two  elements  contain  the  same  number  of  observations 
as  the  original  dataset  and  express  the  original  data  in  terms  of  either  the  view 
of  attributes  or  the  whole  dataset.  Therefore,  they  are  named  as  Structure  El¬ 
ements.  The  defined  elements  are  chosen  based  on  insights  and  knowledge  of 
the  intended  application.  These  two  elements  mentioned  in  above  example  use 
segments  or  strata. 

We  next  describe  how  the  elements  are  used  to  do  the  predictions.  In  general, 
Element  Oriented  Analysis  (EOA)  methodology  has  the  following  components: 

1.  New  elements  representing  the  informative  features  are  generated  by  seg¬ 
menting  the  original  dataset. 

2.  The  resulting  model  uses  the  new  (dements  (and  optionally  the  original  data). 

3.  \  lie  resulting  model  is  multi-level  using  a  Local-Global  hierarchy  resulting 
from  the  use  of  new  Elements  based  on  segments  and  original  data. 

Here,  the  term  Local-Global  hierarchy  refers  to  two  steps  of  Element  Oriented 
Analysis.  Step  l  is  Local  Level  (LL),  for  determining  the  elements  for  each 
individual  application.  Step  2  is  Global  Level  (GL),  which  uses  the  Elements 
from  LL  to  meet  the  modeling  objective. 

EOA  has  been  applied  to  other  applications  such  as  modeling  a  classifier  pre¬ 
dicting  whether  a  credit  card  holder  is  good  or  bad.  In  this  paper,  the  modeling 
objective  is  the  prediction  of  the  next  k- values  (for  a  small  k)  in  a  time  series. 
A  more  detailed  explanation  about  how  EOA  works  on  prediction  modeling  is 
given  in  the  following  Section  3. 

One  practical  concern  is  that  the  same  dataset  could  be  segmented  into  ele¬ 
ments  in  many  different  ways,  which  depends  on  the  data  domain  and  intended 
applications  of  the  model.  1  herefore.  an  important  part  of  the  application  of  the 
EOA  model  is  the  discovery  and  design  of  the  elements.  I11  some  eases,  especially 
in  time  series  prediction  problems,  the  difference  between  two  successive  data 
points  is  a  key  part  of  the  original  data.  Therefore,  an  element  may  be  defined 
to  state  this  change  and  combined  with  the  original  explanatory  variables  to 
predict  the  future  values.  Due  to  the  fact  that  this  kind  of  an  element  differs 
from  the  Structural  Element  conceptually,  we  call  it  the  Changing  Element.  The 
Changing  Element  (CE)  must  also  contain  the  same  number  of  observations  as 
the  original  data.  The  idea  of  a  change  element  was  also  discussed  in  [20]  in  a 
hierachical  distribution  method  for  extracting  knowledge  from  temporal  health 
records. 
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In  the  following  section,  we  focus  on  how  to  design  an  efficient  stratified  model 
for  accurate  time  series  prediction. 


3  Time  Series  Prediction  by  EOA 


A  time  series  exhibits  changes  through  time.  We  apply  EOA  to  time  series  appli¬ 
cations  by  the  design  of  a  Changing  Element  (CE)  that  describes  the  nonlinear 
change  from  time  series  at  the  Local  Level  (LL).  This  CE  is  then  used  in  the 
Global  Level  (GL)  to  estimate  a  prediction  function. 

To  simplify  the  presentation,  we  firstly  assume  that  {Y*}  is  a  time  series 
following  a  general  Autoregressive  process, 


Yt  = 


F 

m) 


./■=!, ....  71, 


(3) 


where  n  is  the  number  of  observations  for  the  time  series,  <t>(B)  =  1  —  <j>iB  — 
...  —  < ppBp  is  a  polynomial  in  B  of  degree  p  and  B  is  the  backshift  operator.  For 
example,  BYt  —  Yj_i.  p  is  a  constant.  In  the  presence  of  nonlinear  changes,  the 
time  series  might  be  affected  by  unobserved  events.  To  describe  the  actual  time 
series  {Af}  subject  to  the  influence  of  nonlinear  changes,  the  following  model  is 
considered: 

Xt  —  Yt  +  —  1 . 7i.  (4) 

where  Yt  follows  a  general  Autoregressive  process  described  in  function  (3).  /(/) 
is  a  parametric  function  that  represents  the  nonlinear  change  of  the  actual  time 
series  Xt  .  According  to  function  (3),  we  obtain  a  new  expression  of  Xt 


F 


+  /(O’  £  —  i 


n. 


(5) 


Then  function  (4)  is  converted  to: 


(6) 


where  F(t)  =  (f>{B)f{i)  -f  p.  We  now  select  F(t)  as  the  CE  of  time  series  Xt. 
Since  ft  is  a  constant,  F(t)  is  specified  as  follows: 


F(t)  =  a(B)f(t) 


(7) 


where  o(B)  —  \  +  p  —  ct\B  — ...  —  asB9  is  a  polynomial  in  B  of  degree  s,  and  / (/) 
is  called  the  Changing  Element  Function  (CEF).  If  the  time  scries  has  a  linear 
assumption,  the  CEF  f(t)  is  followed  by  the  general  Moving  Average  process: 


m  = 


Ct 

0(B) 


,t  =  1 ,  ...,77. 


(8) 


where  n  is  the  number  of  observations  for  the  original  observed  series.  0(B)  — 
]  —  0i B  —  ...  —  0qBq  is  a  polynomial  in  B  of  degree  q.  And  the  CE  f,t  is  a 
sequence  of  white  noise  random  variable's  with  zero  mean  and  variance  of.  If  the 
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Fig.  2.  The  Framework  of  the  Stratified  Predictive  Model  based  on  EOA 


time  series  has  no  assumption  of  linearity  and  it  has  a  sufficient  length  to  train, 
we  might  consider  the  CEF  f(t)  as  follows: 


/w  = 


9(D) 

a(D) 


A,t,  =  \ 


V. 


(9) 


where  0(B)  =  1  0\B  —  ...  -  0rBr  is  a  polynomial  in  B  of  degree  r.  A  is  CE 

representing  nonlinear  variance  of  time  series.  To  minimize  the  overfitting,  we 
apply  conditional  transition  probability  into  A ,  which  explains  the  probability 
of  the  states  of  continuously  increasing,  continuously  decreasing  or  fluctuation, 
etc.  Note  that  the  details  of  conditional  transition  probability  A  are  given  in 
section  3.1. 

According  to  Definition  3,  the  EOA  based  model  should  have  LL  and  GL  in 
time  series  prediction.  Therefore,  we  design  three  steps  in  LL  and  two  steps  in 
GL.  The  comprehensive  framework  of  our  stratified  predictive  model  is  shown 
in  Figure  2. 

To  better  explain  our  EOA  based  stratified  model,  we  first  assume  a  time 
series  X  =  {a*i , X2, . . .  ,  .rft}.  According  to  function  (7)  and  function  (9),  the  goal 
in  LL  is  to  find  CE  and  CEF  from  X  and  the  goal  in  GL  is  to  find  a  Prediction 
Function  based  on  function  (5).  In  the  following,  we  discuss  the  EOA  model  in 
more  detail. 


3.1  Finding  Changing  Element  Function  in  Local  Level 

As  mentioned  in  previous  paragraph  CE  can  be  chosen  in  many  ways  represent¬ 
ing  a  change  between  time  series  points.  However,  the  selection  of  an  optimal 
CE  is  beyond  the  scope  of  this  paper,  although  we  might  consider  it  in  our  fu¬ 
ture  work.  In  this  study,  we  consider  the  transition  probability  A  to  be  the  CE 
and  detected  in  LL.  According  to  the  framework  in  Figure  2.  the  following  three 
steps  are  designed  to  find  the  CE  and  CEF. 

Step  1.  Obtain  the  Observational  Sequence.  We  first  generate  a  series 
of  change  rates  crt  in  (10)  to  describe  the  change  of  the  original  time  series 
X  =  {xi,.r2,....xn}. 


378  Y.  Zhang  et  al. 


cr,  =  (a:t  -  Xt- 1 )  fxt— 1,(2  <t  <  n) .  ( 10) 

Here  the  change  rate  art  is  formed  by  calculating  between  every  two  consecutive 
time  series  data  Xt  and  x%- Here  fs  are  the  same  time  points  as  in  the  original 
series,  and  n  is  the  length  of  the  observed  series.  We  now  obtain  a  change  rate 
sequence  (11)  from  the  original  time  series: 

C  =  {cri,cr2,...,<r„_i}.  (11) 

Then  another  new  sequence  (12)  is  created  to  express  the  possible  difference 
between  two  consecutive  change  rates  as  follows. 

S  —  {$1  —  (cr  2  cr\ ) ,  ....Sn—2  =  (crn_i  cr7l_2)}  (1^) 

Step  2.  Find  the  CE  A.  From  the  sequence  (12),  three  states  can  he  defined 
to  represent  whether  the  change  rate  increases,  decreases  or  keeps  the  same.  We 
define  that  Ss  represents  that  the  new  value  is  the  same  as  the  prior  one;  Su 
represents  that  the  new  value  is  stronger  compared  with  the  prior  one  (the  value 
has  increased);  and  Sd  represents  that  the  new  value  is  weaker  compared  with 
the  prior  one  (the  value  has  decreased).  Accordingly,  any  difference  between  the 
change  rates  ean  be  represented  by  these  three  states. 

(5,  G  (S5,SW,S</).  (13) 

Now,  we  find  the  CE  of  transition  probability  shown  in  Figure  3. 


Fig.  3.  The  Changing  Element  of  Transition  Probability 


There  are  nine  values  among  S*,  Su  and  S<i  for  every  two  consecutive  and 
Si~\.  These  nine  values  are  denoted  by  {«i,S2,  s$}  representing  the  CE  A  of 

transition  probability,  where  =  P  (5,'|5t«i) ,  (1  <  i  <  (m  —  2) ,  1  <  A:  <  9). 

Step  3.  Generate  CEF.  The  obtained  CE  of  transition  probability  is  applied 
in  LL  to  further  estimate  the  CEF.  According  to  function  (G),  we  consider  the 
explanatory  variables  from  the  change  rate  sequence  (10)  and  the  CE  A  in  this 
step.  The  CEF  is  estimated  as  function  (14) 

n 

crt  =  0o  +  ^2  fiicrt-iA. 

»=i 


(14) 
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where  0q  is  an  intercept  and  represent  the  coefficients  in  CEP.  Here  i  is 
a  certain  interval  between  crt  and  another  observation  In  order  to  guarantee 
the  best  fit  of  the  CEF,  we  also  need  to  select  significant  explanatory  variables 
in  terms  of  a  predefined  significant  Level.  Usually,  the  significant  level  0.01  is 
adopted  as  the  selection  threshold. 

3.2  Finding  a  Prediction  Function  in  Global  Level 

hi  the  Global  Level  of  the  EOA  model,  the  CEF  is  applied  to  build  a  prediction 
function  through  another  two  steps  described  as  follows 

Step  1.  Shift  the  CEF  into  the  Prediction  Function.  The  CEF  f(t)  accu¬ 
rately  explains  the  nonlinear  change  trend  of  time  series.  According  to  function 
(4),  it  therefore  is  allocated  to  be  a  replacement  of  constant  //  in  the  prediction 
function.  All  possible  lag  variables  are  regarded  to  be  independent  variables  in 
the  initial  prediction  function.  In  addition.  0.01  significant  level  is  adopted  as 
the  selection  threshold. 

Step  2.  Find  the  Prediction  Function.  The  selection  from  Step  1  is  per¬ 
formed  to  all  independent  variables  and  is  repeated  until  all  trivial  independent 
variables  are  filtered.  These  selected  independent  variables  and  the  CEF  f(t)  are 
eventually  formed  into  the  prediction  function  (15)  as  follows. 


n 


n 


(15) 


where  no  is  an  intercept  and  au  and  02?  represent  the  coefficients  in  the  predic¬ 
tion  function,  i  is  a  certain  lag  interval  between  the  different  observations. 

Our  stratified  predictive  model  not  only  provides  a  solution  to  overcome  the 
linear  assumption  from  ARJMA  by  using  nonlinear  CE  A,  but  gives  an  estab¬ 
lished  function  to  predict  short-term  time  series.  Due  to  the  fact  that  EOA 
model  adopts  autoregression  as  the  main  body,  we  only  need  to  estimate  several 
parameters  in  the  prediction  function  in  the  Global  Level.  As  a  result,  the  risk 
of  overfitting  is  much  lower  than  with  Neural  Networks,  in  addition,  the  train¬ 
ing  and  design  time  efficiency  of  the  EOA  model  should  be  better  than  that  of 
Neural  Networks  as  well.  In  the  next  section,  we  compare  the  prediction  per¬ 
formance  of  the  EOA  model  with  ARIMA  and  Neural  Networks  through  a  real 
world  empirical  study. 

4  Empirical  Study 


In  the  following,  we  first  discuss  the  selected  time  series  and  the  experimental 
setup.  Then  we  provide  the  time  series  prediction  results  along  with  discussion. 
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4.1  Experiment  Data  and  Setup 

The  selected  time  series  are  daily  stock  indexes  from  eight  major  stock  ex¬ 
change  markets.  The  eight  markets  are  stock  exchange  markets  of  United  States 
of  America,  United  Kingdom  (from  02/Jan/l986  to  31  /Dec/2004),  Canada, 
Germany  (from  Q2/Jan/1986  to  30/Dec/2004),  Japan  (from  04  /Jan/1988  to 
30/Dec/2(K)4),  Spain  (from  30/Dcc/1991  to  29/Dec/2004),  Taiwan  (from  24 /Jail/ 
1989  to  31/Dec/2004)  and  Singapore  (from  08/Jan/1988  to  31/Dec/2004). 

According  to  the  reported  successful  applications,  such  as  [3],  [4],  [5],  [11], 
[10],  [13].  [15],  [17],  we  choose  ARIMA  and  Neural  Network  as  two  benchmarks 
in  this  experiment. 

For  the  ARIMA  approach,  we  adopt  the  viewpoint  of  Man  [9],  who  specified  the 
order  P  —  2  of  the  autoregressive  model  in  addition  to  the  order  Q  —  2  of  the  mov¬ 
ing  average  model  which  can  predict  the  best  result.  For  the  Neural  Networks  ap¬ 
proach,  we  cite  the  research  of  Nam  and  Schaefer  [1 2]  who  obtained  an  accurate  pre¬ 
diction  of  international  airline  passengers  by  applying  BPNN  (Back-Propagation 
Neural  Network).  We  have  made  many  comparisons  between  the  different  struc  ¬ 
tures  of  BPNN  based  on  their  successful  experience.  In  the  end  we  notice  that 
BPNN  with  three  layers  and  twelve  hidden  nodes  outperforms  the  other  structures 
for  these  time  series.  Therefore,  we  run  three  prediction  models,  BPNN  (3,  12  1), 
ARIMA  (2,  2)  and  ours  for  each  index  time  series  respectively  in  this  experiments. 
We  also  record  the  prediction  values  and  the  corresponding  running  time. 

Predictive  accuracy  is  the  most  important  performance  criterion  in  this 
application [19],  so  we  report  two  frequently  used  predictive  accuracy  measures, 
the  Mean  Absolute  Percentage  Error  (MAPE)  and  the  Mean  Squared  Error 
(MSE)  in  our  comparison. 

In  our  stratified  model,  the  CE  presents  the  correct  change  trend  of  the  time 
series  if  the  accuracy  of  prediction  is  satisfied.  As  a  result,  we  need  to  train  the 
time  series  until  the  residual  of  the  prediction  falls  into  a  certain  range.  In  this 
experiment,  we  set  -0.1  to  0. 1  to  be  acceptable  range.  Figure  4  shows  an  example 
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Fig.  4.  An  Example  of  Training  Time  Series  in  USA  index 
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of  training  USA  index  time  series,  in  this  example,  the  prediction  result  from  our 
stratified  model  has  the  fluctuations  during  day  1  to  day  600.  These  fluctuations 
shrink  from  day  600  and  stay  stable  in  the  range  -0.1  to  0.1  until  day  725.  Hence, 
we  select  the  first  725  days  as  the  training  set  for  the  USA  index. 


4.2  Results 

Tables  1  and  2  record  the  one-day  ahead  prediction  results  among  BPNN, 
A  RIM  A  and  EOA  over  ten  consecutive  trading  days  and  twenty  consecutive 
trading  days. 

Table  3  records  the  running  time  of  three  methods  in  the  tests  of  I  day,  10 
days  and  20  days. 

In  the  10  trading  days  test  (Table  1).  the  EOA  model  outperforms  BPNN  and 
AR1MA  for  five  index  time  series  (USA,  UK,  Taiwan,  Germany  and  Canada  in¬ 
dex  time  series).  BPNN  performs  slightly  better  for  Spain.  Singapore  and  Japan 
index  time  series. 

However,  we  observe  from  Table  3  that  BPNN  requires  much  more  computing 
time  than  onr  model  to  obtain  an  accurate  prediction.  The  main  reason  is  that 
Neural  networks  need  more  time  to  train  on  the  time  series.  For  example,  it 
consumes  20.63  seconds  to  predict  10  trading  days  in  Spain  index  time  series 


Table  1.  Prediction  Accuracy  of  Three  Methods  over  10  Trading  Days 


rI 

’his  Work 

BPNN 

A  RIM  A 

MAPE 

MSE 

MAPE 

MSE 

MAPE 

MSE 

USA 

3.44% 

1.76407E-06 

4.70% 

2.6 1 984  E- 06 

9.80% 

9.71  E-06 

UK 

4.12% 

4.69458E-06 

5.07% 

8.22354  E- 06 

10.55% 

2.88690E-05 

Taiwan 

5.40% 

0.001526522 

5.01% 

0.001593115 

9.99% 

0.004888072 

Spain 

7.18% 

2.62739E-06 

7.10% 

2. 509 98 E- 00 

10.16% 

1 .09405E-05 

Singapore 

.3.33% 

2.40009E-05 

2.80% 

1.91522 E-05 

7.75% 

0.00013073 

Japan 

0.04% 

2.09492 E- 05 

0.54% 

1.4I529E-05 

0.02% 

2.21725E-05 

Germany 

2.12%. 

7.53072E-06 

2.33% 

9.99303E-06 

3.15% 

2.06237E-05 

Canada 

5.25% 

3.97333 E- 06 

6.39% 

6. 104 92 E- 06 

1 1 .47% 

0.000018844 

Table  2.  Prediction  Accuracy  of  Three  Methods  over  20  Trading  Days 


Th 

is  Work 

BPNN 

ARIMA 

MAPE 

MSE 

MAPE 

MSE 

MAPE 

MSE 

USA 

4.34% 

3.23740E-00 

4.52% 

3.23885E-00 

8.10% 

8.70E-06 

UK 

3.74% 

4.7031  E-0G 

4.44% 

5.85332E-00 

8.90% 

2.23537E-05 

Taiwan 

3.70% 

0.000852490 

4.27%, 

0.001045931 

0.97% 

0.002903487 

Spain 

7.08% 

4.54583E-OG 

8.32% 

6.58587E-06 

15.00% 

2.28342  E-05 

Singapore 

2.99% 

2.0497 E-05 

2.47% 

1.5324E-05 

0.30% 

0.000091764 

Japan 

0.63% 

2.2994 1  E-05 

0.55% 

1.60741E-05 

0.98% 

6. 166 16  E-05 

Germany 

2.19% 

9.0955E-06 

2.19% 

9.65026E-06 

3.35% 

2.3618615-05 

Canada 

4.17% 

2.92397E-00 

5.28% 

4.74701  E-00 

9.72% 

0.000014404 
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Table  3.  Running  Time  (in  second)  of  Three  Methods 


This  Work  BPNN  A  RIM  A 


1  day  10  days  20  days  1  day  10  days  20  days  1  day  10  days  20  days 


USA 

0.13 

1.33 

2.63 

2.08 

20.83 

41.63 

0.08 

0.83 

1.63 

UK 

0.14 

1.43 

2.83 

2.12 

21.23 

42.43 

0.12 

1.23 

2.43 

Taiwan 

0  11 

1.13 

2.23 

2.13 

21.33 

42.63 

0.11 

1.13 

2.23 

Spain 

0.16 

1.63 

3.23 

2.06 

20.63 

41.23 

0.05 

0.53 

1.03 

Singapore 

0.11 

1.13 

2.23 

2.01 

20.13 

40.23 

0.11 

1.13 

2.23 

Japan 

0.12 

1.23 

2.43 

2.05 

20.53 

41.03 

0.12 

1.23 

2.43 

Germany 

0.13 

1.33 

2.63 

2.06 

20.63 

41.23 

0.12 

1.23 

2.43 

Canada 

0.11 

1.13 

2  23 

1.99 

19.93 

39.83 

0.05 

0.53 

1.03 

10  days  running  time  =  output  time  4-  l  day  running  timexlO 
20  days  running  time  =  output  time  4-  1  day  running  tirnex20 
output  time  =0.03  second 


while  our  model  spends  1.G3  seconds  only.  In  20  trading  days  test  (Table  2),  the 
EOA  model  has  the  best  accuracy  for  six  time  scries  (USA,  UK,  Taiwan,  Spain, 
Germany  and  Canada  index  time  series).  BPNN  does  better  for  the  Singapore 
and  Japan  index  time  series.  Therefore,  we  might  conclude  from  the  experiment 
that  our  model  is  competitive  in  both  accuracy  and  time  efficiency. 


5  Conclusions 

Ill  this  paper,  we  have  propose  the  Element  Oriented  Analysis  model  to  predict 
short-term  time  series.  The  EOA  model  mitigates  some  technical  drawbacks  of 
ARIMA  and  Neural  Networks.  The  accuracy  and  time  efficiency  of  the  EOA 
model  relative  to  ARIMA  and  Neural  Networks  is  demonstrated  by  an  experi¬ 
ment  on  stock  indexes.  Comparing  with  these  two  mainstream  models,  the  ex¬ 
perimental  results  suggest  that  the  EOA  based  stratified  model  is  competitive 
in  accuracy  and  time  efficiency.  Our  further  work  will  extend  the  EOA  model 
to  work  on  other  real-world  financial  applications,  such  as  credit  scoring  and 
bankruptcy  prediction. 
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Abstract.  Modern  product  design  and  manufacturing  process  are  highly  inte¬ 
grated  and  exposed  to  frequent  changes.  This  has  made  information  reuse  play 
an  increasingly  important  role  in  improving  the  efficiency  of  the  product  devel¬ 
opment  process.  Mechanical  Assembly  Sequence  Planning  (MASP)  is  a  key 
issue  in  the  manufacturing  of  a  product.  Known  methods  for  MASP  are  not  sat¬ 
isfactory  from  the  aspect  of  information  reuse.  This  paper  proposes  an  Answer 
Set  Programming  (ASP)  based  solution  to  MASP,  where  information  reuse  is 
enhanced  by  dividing  an  ASP  program  into  EDB  (cxiensional  database)  and 
1DB  (intensional  database)  such  that  1DB  can  be  shared  by  all  the  assemblies 
with  lhe  same  number  of  parts.  Compared  with  olher  approaches  for  MASP, 
lhis  is  a  great  advantage.  Experiments  are  conducted  to  show  lhc  applicability 
and  performance  of  our  meihod  by  using  different  answer  sel  solvers. 

Keywords:  Mechanical  Assembly  Sequence  Planning,  Answer  Sel  Program¬ 
ming,  EDB,  IDB 


1  Introduction 

The  highly  competitive  nature  of  global  market  of  manufacturing  products  has  made 
product  design  and  manufacturing  strategies  integrated,  computerized  and  always 
exposed  to  frequent  changes.  In  this  environment,  information  reuse  in  these  two 
processes  plays  an  increasingly  important  role  in  improving  the  efficiency  of  the 
product  development  process.  Mechanical  Assembly  Sequence  Planning  (MASP)  is 
the  task  of  finding  the  feasible  or  optimal  sequence  that  puts  the  initially  separated 
parts  of  an  assembly  together  to  form  the  assembled  product.  A  MASP  algorithm 
takes  as  input  the  CAD  model  of  an  assembly  produced  by  the  product  design  process 
and  produces  feasible  assembly  sequences  for  the  assembly.  Much  effort  has  been 
devoted  to  this  research  and  many  methodologies  have  been  proposed.  In  literature, 
there  exist  a  large  number  of  algorithms  or  systems  for  assembly  sequence  generation. 
These  systems  differ  both  in  the  representation  of  assembly  sequences  and  in  the 
reasoning  technique  used  to  identify  feasible  sequences.  Classic  methods  include 
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interactive  systems,  which  work  by  asking  user  questions  to  obtain  information  neces¬ 
sary  to  construct  feasible  assembly  sequences  [1,17],  and  cut-set  methods  that  use  cut 
set  algorithm  to  find  feasible  assembly  sequences  [7].  The  planning  algorithm  based 
on  OBDDs  is  a  variation  of  the  classic  methods  [6],  and  can  be  viewed  as  an  attempt 
to  attack  the  combinatorial  state  explosion  problem  in  storing  all  feasible  assembly 
sequences.  Another  line  of  research  is  to  use  soft  computing  techniques,  such  as  simu¬ 
lated  annealing  algorithm  and  genetic  algorithm  to  generate  assembly  sequences  |9, 
12].  There  also  exists  an  effort  to  build  expert  knowledge  based  systems  to  generate 
feasible  assembly  sequences  [  16,18]. 

The  above  mentioned  assembly  sequence  generation  systems  and  algorithms  are 
not  satisfactory  from  the  aspect  of  information  reuse.  For  example,  in  the  OBDD 
(Ordered  Binary  Decision  Diagram)  based  method  proposed  in  [6],  once  the  liaison 
graph  or  the  interference  relation  changes,  all  OBDDs  describing  the  contact  and 
inference  information  of  the  original  assembly  have  to  be  rewritten  for  the  new  as¬ 
sembly.  This  process  is  very  time-consuming  and  usually  requires  expert  knowledge. 
In  contrast,  there  is  significant  information  reuse  in  expert  knowledge  based  systems. 
But  the  reused  knowledge  is  related  to  the  special  structures  of  assemblies  [18].  This 
paper  is,  however,  mainly  concerned  with  the  reuse  of  geometric-based  knowledge  of 
assemblies. 

MASP  is  a  special  kind  of  planning  problem.  In  the  last  decade,  an  important 
method  for  solving  planning  problems  is  to  make  use  of  declarative  programming 
languages,  such  as  Answer  Set  Programming  (ASP),  to  make  the  solution  to  be  de¬ 
clarative  [II].  For  example,  an  ASP  based  method  allows  us  to  divide  the  planning 
process  into  two  stages:  problem  description  in  ASP  and  using  general  purpose  an¬ 
swer  set  solvers,  such  as  DLV,  smodels  and  emodels,  to  find  solutions  [4,10,13].  The 
main  advantage  of  a  declarative  method  is  that  it  allows  professionals  to  be  concerned 
mainly  with  “what'’  a  solution  must  satisfy  and  not  with  the  details  “how”  to  find  a 
solution  to  the  problem.  From  our  point  of  view,  this  separation  provides  a  chance  of 
information  reuse.  Specifically,  what  a  solution  must  satisfy  can  be  divided  into  two 
parts:  one  part  is  case-sensitive,  and  the  other  is  applicable  to  a  class  of  problem.  The 
latter  part  is  the  information  that  can  be  reused.  If  we  store  these  information  with  a 
logic  program,  this  division  corresponds  naturally  to  the  concepts  of  EDB  (exten- 
sional  database),  which  represents  a  collection  of  facts,  and  IDB  (intensional  data¬ 
base),  which  represents  the  reasoning  components  [  14]. 

ASP  is  a  kind  of  logic  programming  language  under  answer  sets  semantics.  The 
expressiveness,  declarative  nature  and  existence  of  efficient  answer  set  solvers  has 
made  ASP  a  mainstream  tool  for  knowledge  representation  and  reasoning  [2].  This 
paper  proposes  an  ASP  based  method  for  MASP,  where  all  assembly  knowledge  is 
represented  with  ASP  rules.  The  case-sensitive  information,  such  as  contact  and  inter¬ 
ference  relation  is  included  in  EDB,  and  general  information  that  is  applicable  to  a 
class  of  problem  cases  is  included  in  IDB.  When  the  case-sensitive  information  is 
changed,  we  only  change  the  EDB  of  the  knowledge,  and  IDB  is  left  unchanged.  It  is 
shown  that  the  information  reuse  is  greatly  improved  in  this  method;  and  acceptable 
performance  is  also  achieved. 
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2  Preliminaries 

We  first  briefly  introduce  ASP  [3].  The  answer  set  semantics  of  logic  programs  treats 
a  rule  with  variables  as  shorthand  for  the  set  of  its  ground  instances.  So  in  defining  the 
answer  sets  semantics  wc  assume  that  all  the  rules  in  a  program  do  not  contain  vari¬ 
ables,  We  follow  the  terminology  style  of  DLV  and  write  classic  negation  as  [10]. 
Let  A  be  an  atom,  a  literal  takes  the  form  A  or  ~A,  where  A  is  a  positive  literal  and  A 
is  a  negative  literal ;  A  and  -A  are  called  complementary  literals. 

An  extended  disjunctive  logic  program  P  is  a  set  of  rules,  and  each  rule  r  is  of  the 
form: 

P\  v...v  L,k ,-  L,k+\t...fL,mf  not  L>m+\ ,  ...»  not  L>n, 

where  //>///>£>(),  each  L,  is  a  literal,  and  not  is  the  negation  as  failure  (NAF).  We 

define  head{r)-{L\,...,  Lk)  as  the  head  of  r,  pos(r)=[L\ . Z*,}  and  neg(r)=[Lm+ 1»...» 

Ln]  as  the  positive  and  negative  literals  present  in  body  of  r,  respectively.  In  particu¬ 
lar,  a  rule  r  without  head  is  called  a  constraint. 

Next  is  the  definition  of  the  answer  sets  for  extended  logic  programs  without  NAF. 
Let  7i  be  an  extended  logic  program  without  not ,  and  lit  be  the  set  of  ground  literals  in 
the  language  of  n.  An  answer  set  for  k  is  any  minimal  subset  S  of  lit  such  that 

a)  for  each  rule  re  71,  if  pos(r)^S,  then  there  exists  some  lehead(r)  such  that  leS: 

b)  if  S  contains  complementary  literals,  then  S=lit. 

This  definition  can  be  extended  to  programs  with  NAFs  as  follows.  Let  ti  be  an  ex¬ 
tended  logic  program,  and  lit  be  the  set  of  ground  literals  in  the  language  of  71.  For  any 
set  Sc//7,  let  Ksbc  the  program  obtained  from  71  as  follows 

7C*=  {/  I  ren,  neg(r)nS=0 ,  head(r')=head(r),  pos(r)=pos(rf  neg(r)=0}. 

Clearly  Ks  does  not  contain  not ,  so  its  answer  sets  are  already  defined.  If  S  is  one  of 
them,  then  S  is  an  answer  set  for  n. 

Assembly  Sequence  Planning  [5,8J 

A  mechanical  assembly  is  a  composition  of  interconnected  parts  forming  a  stable  unit. 
Each  part  is  a  solid  rigid  object,  that  is,  its  shape  remains  unchanged.  Parts  are  inter¬ 
connected  whenever  they  have  one  or  more  compatible  surfaces  in  contact.  Surface 
contacts  between  parts  reduce  the  degree  of  freedom  for  relative  motion. 

A  subassembly  is  a  nonempty  subset  of  parts  that  either  has  one  element  (i,c.,  only 
one  part)  or  is  such  that  every  part  has  at  least  one  surface  contact  with  another  part  in 
the  subset.  Although  there  arc  cases  where  it  is  possible  to  join  a  pair  of  parts  in  more 
than  one  way,  unique  assembly  geometry  will  be  assumed  for  each  pair  of  parts. 

It  is  assumed  that  whenever  a  subassembly  is  formed,  all  connections  between  its 
parts  are  established.  Therefore  a  subassembly  can  be  characterized  by  its  set  of  parts. 
Given  two  subassemblies  characterized  by  their  sets  of  parts  Sj  and  S2,  joining  Si  and 
S2  is  an  assembly  task  if  S=  S|US2  is  a  subassembly.  An  assembly  task  is  said  to  be 
geometrically  feasible  if  there  is  a  collision-free  path  to  bring  the  two  subassemblies 
into  contact  from  a  situation  in  which  they  are  far  apart.  For  the  purpose  of  verifying 
the  geometric  feasibility  of  an  assembly  task,  Gottipolu  and  Ghosh  introduced  a 
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translation  function  T  from  the  viewpoint  of  disassembling  [5].  In  this  paper  we  will 
redefine  the  function  from  the  viewpoint  of  assembling.  In  specific,  T  is  defined  as 


T:  PxDxP  — >  { 0, 1 ), 


where  P  is  the  set  of  parts  of  an  assembly,  and  /)={  1,2, 3.4, 5,6}  denoting  the  six  direc¬ 
tions  ( 1, 2,  3  for  X+,  Y+,  Z+,  and  4,  5,  6  for  X-,  Y-  and  Z-,  respectively).  T(a ,  d ,  b)-\ 
if  and  only  if  part  b  has  the  freedom  of  translational  motion  w.r.t.  part  a  in  direction  d 
from  far  away. 

Example :  The  value  of  the  translation  function  for  the  assembly  in  Fig.l  is  1  on  the 
following  set  of  triples; 

{(a,3,b)  (a.4,b)  (a,6,b)  (a.l.c)  (a,3,c)  (a,4.c)  (a,5.c)  (a,6,c)  (a,5,d)  (b,l.a)  (b,3,a)(bf6,a)  (b.t.c)  (b.3,c)  (b,4,c)  (b,5,c) 
(b,6.c)  (b.5,d)  (c,l,a)  (c,2,a)  (c.3.a)  (c,4.a)(c.b,a)  (c,l,b)  (c,2»b)  (c,3.b)  (c.4.b)  (c.6.b)(c.5,d)  (d.2.a)(d.2,b)(d,2,c)|. 

The  assembly  process  consists  of  a  succession  of  assembly  tasks,  eaeh  of  which  con¬ 
sists  of  joining  subassemblies  to  form  a  larger  subassembly  .  The  process  starts  with 
all  parts  separated  and  ends  with  all  parts  properly  joined  to  form  the  whole  assembly. 
It  is  assumed  that  exactly  two  subassemblies  are  joined  at  eaeh  assembly  task,  and 
that  after  parts  have  been  put  together,  they  remain  together  until  the  end  of  the  as¬ 
sembly  process. 

It  is  also  assumed  that  whenever  two  parts  are  joined  all  eontaets  between  them  are 
established.  Due  to  this  assumption  an  assembly  can  be  represented  as  an  undirected 
graph  <P ,  C>,  where  P  is  the  set  of  nodes  and  C  is  the  set  of  edges.  Each  node  in  P 
corresponds  to  a  part  in  the  assembly  and  there  is  an  edge  in  C  connecting  every  pair 
of  nodes  whose  corresponding  parts  have  at  least  one  surface  eontaet.  The  elements  in 
C  are  referred  to  as  connections ,  and  <P,  C>  is  referred  to  as  the  assembly’s  connec¬ 
tion  graph.  A  connection  encompasses  all  eontaets  between  parts. 

Example :  Fig.l  gives  an  assembly  in  its  exploded  view  (a)  and  assembled  view  (b). 
The  connection  graph  for  the  assembly  is  shown  in  Fig.  1(e). 


(b) 


(c) 


Fig.  I.  An  example  assembly  in  its  (a)  exploded  view  and  (b)  assembled  view,  and  its 
corresponding  (e)  connection  graph 

The  state  of  the  assembly  proeess  is  the  configuration  of  the  parts  at  the  beginning 
or  at  the  end  of  an  assembly  task.  The  configuration  is  given  by  the  eontaets  that  have 
been  established.  Sinee  whenever  two  parts  are  joined  all  eontaets  are  established,  the 
configuration  is  given  by  the  connections  that  have  been  established. 
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For  an  assembly  with  n  parts,  an  assembly  sequence  is  an  ordered  set  of  n-1  tasks. 
Alternatively,  assembly  sequences  can  be  represented  by  an  ordered  sequence  of 
states  [8].  In  this  paper  we  will  use  an  ordered  list  of  binary  vectors  to  denote  an  as¬ 
sembly  sequence,  where  each  vector  corresponds  to  a  state  and  the  number  of  the  list 
elements  is  equal  to  the  number  of  parts. 

Example :  Since  there  are  five  connections  for  the  assembly  in  Fig.  1,  we  can  use  a 

vector  (/?i . bf)  to  denote  an  assembly  state  where  connection  e,  is  established  only  if 

bj=  1.  Then  an  example  assembly  sequence  is: 

(00000)  -( 1 0000)  -( 1 1 000)  -(Mill)  (*) 

which  corresponds  to  an  assembly  sequence  where  connection  e\  is  established  Firstly, 
e2is  established  secondly,  and  e3,  e4,  e$  are  established  thirdly. 

A  linear  assembly  sequence  is  one  in  which  each  task  involves  the  insertion  of  a 
single  part  into  the  other  subassembly  [15].  An  assembly  for  which  a  linear  assembly 
sequence  exists  is  called  a  linear  assembly.  This  paper  is  mainly  concerned  with  linear 
assemblies.  By  definition  each  linear  assembly  sequence  may  correspond  to  one  or 
more  orders  of  the  assembly  parts.  For  example,  the  assembly  sequence  above  is  ob¬ 
viously  linear,  from  which  we  could  get  two  assembly  orders: 

{abed,  bacd). 

Next  we  show  that  from  any  element  of  the  set  of  orders,  the  sequence  (*) 
can  be  reproduced.  We  first  define  the  state  corresponding  to  a  non-empty  prefix 
of  an  assembly  order.  Let  (b\,...9bn)  denote  a  state  where  there  arc  n  connections 
in  total,  and  b,  indicates  whether  the  connection  b,  is  established.  The  map  /  is 
defined  as: 

f{a{...ak)=(bx . bn\  (k<n) 

where  /?,=  1  if  and  only  if  connection  et  is  established  in  the  subassembly  {c/| . <7*}. 

Take  the  assembly  order  abed  as  an  example.  By  definition  we  have: 

/(a)=( 00000 ),/(ab)=(  1 000()),/fabe)=(  1 1000),  /(abed  )=(1 1111). 

The  above  states  exactly  correspond  to  the  sequence  (*).  So  in  linear  assembly,  the 
space  of  assembly  sequences  can  always  be  reproduced  from  the  set  of  possible  as¬ 
sembly  orders.  Hence  we  can  safely  use  the  assembly  order  of  parts  to  represent  the 
corresponding  assembly  sequence. 

An  assembly  sequence  is  geometrically  feasible  if  all  its  assembly  tasks  are 
geometrically  feasible.  To  verify  the  geometric  feasibility  of  an  assembly  sequence, 
two  types  of  constraints,  connectivity  constraints  and  precedence  constraints, 
must  be  considered  [5],  The  connectivity  constraints  specify  which  parts  are 
connected  to  other  parts  in  terms  of  an  assembly  operation.  In  specific,  we  require 
that: 

Any  part  assembled  at  a  step  should  have  some  contact 
with  at  least  one  previously  assembled  part. 
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The  precedence  constraints  represent  the  faet  that  some  assembly  tasks  have  to  be 
implemented  before  the  others,  otherwise,  they  will  interfere  with  latter  assembly 
operations.  In  speeifie,  we  require  that: 

For  any  part  assembled  at  a  step ,  there  exists  a  collision-free  path  to  bring  the  part 
into  contact  with  the  subassembly  that  consists  of  previously  assembled  parts. 

Example :  The  assembly  sequence  “aedb"  is  not  feasible  because  there  exist  no  direc¬ 
tion  in  which  part  b  ean  be  brought  into  proper  contact  with  the  subassembly  that 
consists  of  a,  c  and  d. 

3  Formulation  of  Assembly  Knowledge 

This  section  will  diseuss  the  description  of  assembly  knowledge  in  terms  of  DLV 
language. 

3.1  Representation  of  Assembly  Sequences 

Following  the  discussion  in  seetion  2,  we  will  use  assembly  orders  or  permutations  of 
assembly  parts  to  represent  assembly  sequences.  Given  an  m-part  assembly,  i.e.  an 
assembly  with  m  parts,  the  parts  are  encoded  by  l,...,m.  Assembly  steps  are  also 

denoted  by  the  set  of  integers  {  1 . m}.  An  assembly  sequence  is  denoted  by  a  set  of 

m  pairs  <i,  n>,  where  i  is  an  assembly  step  and  n  denotes  a  part.  By  <i,  n>  we  mean 
that  part  n  is  assembled  at  step  i.  So  for  any  <i,  n>  and  <j,  k>,  n*k  if  i*j. 

If  we  use  a  unary  predicate  p(X)  to  denote  that  X  is  a  part  of  the  assembly  under 
consideration,  the  know  ledge  of  parts  can  be  represented  by  the  following  set  of  facts. 

{ p(  1 ).  p(2).  •••  p(in).}. 

In  order  to  represent  a  assembly  sequence  by  ASP,  a  binary  predicate  s(I,  X)  is  intro¬ 
duced  to  denote  that  part  X  is  assembled  at  step  I.  Then  an  assembly  sequence  for  an 

m-part  assembly  ean  be  represented  by  a  set  of  m  atoms  (s(l,  xU . s(m,  xm)}.  where 

x,e  { 1 . m}  and  ie  { 1 . m). 

Example:  Given  the  assembly  in  Fig.  1,  if  we  use  integer  1,  2,  3  and  4  to  encode  part 
a,  b,  e  and  d,  respectively,  an  assembly  sequence  ean  be  described  as  follows: 

{s(  1,1),  s(2,2),  s(3,3),  s(4,4)  }. 

In  this  assembly  sequence,  the  order  for  the  parts  to  be  assembled  is  1  -2-3-4. 

For  an  m-part  assembly,  the  constraints  that  any  part  of  the  assembly  must  be  as¬ 
sembled  ean  be  described  as: 

For  each  part  kef  1  ,...,mf,  there  exists  some  step  J  such  that  s(J .  k)  holds. 

This  constraint  can  be  represented  as  m  rules  with  each  rule  corresponding  to  one 
part: 

s(l,  1)  v...v  s(m,  1).  ...  s(l,  m)  v  ...v  s(m,  in). 
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The  constraints  that  any  part  can  be  assembled  only  once  can  be  described  as: 

For  each  part  X  and  two  steps  I  and  J , 
if  V.-J  then  s(h  X)  and  s(J ,  X)  cannot  be  true  simultaneously. 

In  DLV  language,  the  constraint  can  be  described  as  follows: 

s(I,  X),  s(J,  X),  I!=J. 

When  the  constraint  is  instantiated,  we  will  get  m  instances, 

3.2  Representation  of  the  Assembly’s  Connection  Graph  and  Translation 
Function 

ASP  can  represent  an  assembly’s  connection  graph  in  a  straightforward  manner.  Let 
binary  predicate  a(X,Y)  denote  that  part  X  and  part  Y  have  a  surface  contact.  The 
assembly’s  connection  graph  in  Fig.  1(c)  can  be  translated  into  ASP  facts: 

{ a(  1 ,2).a(  1 ,3).a(  1 ,4).a(2,4).a(3,4). ) 

and  the  following  two  rules: 

{ a(X, Y):-  a(Y,X).  ~a(X,Y):-  not  a(X,Y),  p(X),  p(Y). ) 

where  1,  2,  3  and  4  are  the  encoding  of  part  a,  b,  c  and  d,  respectively. 

The  first  rule  says  that  the  edges  in  a  connection  graph  are  undirected,  and  the  sec¬ 
ond  rule  claims  that  the  surface  contact  information  is  complete  and  hence  can  be 
used  with  closed  world  assumption,  i.e.  any  contact  information  which  can  not  be 
derived  from  the  rules  and  facts  in  the  above  is  false. 

Translation  function  can  be  represented  as  ASP  rules  in  a  very  natural  manner.  To 
do  this  we  need  to  create  a  triple  predicate  pre(X,  D,  Y),  which  means  that  part  X 
does  not  have  the  freedom  of  translational  motion  w,r,t.  part  Y  in  direction  D  from  far 
away,  i.e.  part  Y  prevent  the  motion  of  X  in  direction  D.  Therefore  the  predicate  prcQ 
can  be  viewed  as  the  interference  relation  between  a  pair  of  parts.  By  the  definitions 
of  pre()  and  translation  function  T,  the  following  proposition  holds. 

Proposition  3.1.  Atom  pre(a,  /,  b)  is  true  if  and  only  if  7Tb,  /,  a)=0. 

By  the  above  proposition,  the  translation  function  of  the  assembly  in  Fig.  1(a)  can 
be  translated  into  the  following  set  of  facts: 

{ pre(2, 1 , 1  ).prc(2,2, 1  ).pre(2,5, 1  ).pre(3,2, 1  ).pre(4, 1  I  ).pre< 4,2, 1  ).pre(4,3, 1  ).pre(4,4. 1  ).pre(4,6, 1  ).pre(  1 ,2,2). 
prc(l,4,2).prc(l  ,5,2).pre(3,2,2).prc(4,l,2)prc(4.2,2).prc(4,3,2).pre(4,4,2).prc(4.6,2).pre(  l,5,3).pre(2,5,3). 
pre(4,l,3).pre(4,2,3).pre(4.3,3).pre(4,4,3)pre(4,6,3).pre(  l,l,4).pre(l,3,4).pre(  I,4,4).pre(l,5,4).pre(l,6,4). 
pre(2,l,4).pre(2,3,4).pre(2,4,4).pre(2,5,4).pre(2,6,4).prc(3tl,4).prc(3,3,4).prc(3,4,4).prc(3,5 ,4).pre(3,6,4).} 


3.3  Representation  of  Connectivity  Constraints  and  Precedence  Constraints 

Straightforward  Representation  for  Connectivity  Constraints  (SR-CC) 

Given  an  assembly  and  a  connection  graph,  the  connectivity  constraints  require  that 
the  last  assembled  part  must  have  surface  contact  with  at  least  one  part  that  has  been 
assembled.  If  we  use  the  predicate  a(X,  Y)  this  constraints  can  be  described  in  a  very 
natural  way.  All  the  constraints  are  of  the  following  form: 

:-  s(n,  Xn),  s(n-l ,  X,,),...,  s(l,  X,),  ~a(Xn,  Xn_, ),...,  ~a(Xn.  X,). 


(A.l) 
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The  above  constraint  says  that  if  the  most  recently  assembled  part  is  Xn,  Xn  eannot 
have  surfaee  contacts  with  none  of  the  pails  assembled  in  the  previous  steps.  If 

an  assembly  has  m  parts,  for  eaeh  ne{2 . m},  there  will  be  a  constraint  of  the 

form  (A.l);  eaeh  constraint  will  have  mn  ground  instances.  So  in  total  there  will 
be  Zn=2m(mI1)  instances.  Therefore  it  is  prcdicable  that  this  representation  of  the  con¬ 
nectivity  constraints  has  an  exponential  spaee  complexity.  This  motivates  us  to  find 
more  efficient  representation,  where  each  constraint  contains  a  smaller  number  of 
variables. 

Improved  Representation  for  Connectivity 1  Constraints  (IR-CC) 

Recall  the  image  of  a  set  of  elements  in  a  graph.  Given  an  assembly’s  connection 
graph  <P,  Q>  and  a  set  GczP  of  nodes,  the  image  of  G  in  the  graph  is  defined  as: 

lmage(G)={e2l  <  cj,  e2>e  C  a  ejeG  ). 

If  a  part  i  is  assembled  in  the  first  step  and  part  j  is  assembled  in  the  seeond  step, 
i.e.  s(l,  i)  and  s(2,  j)  holds,  then  j  must  be  an  element  of  Imagc({i})  in  the  connection 
graph.  Generally  if  s(n,  j)  hold,  j  must  belong  to  the  image  of  the  set  of  parts 
assembled  from  step  1  to  step  (n-1).  In  order  to  represent  this  knowledge,  we  intro¬ 
duce  a  set  of  constants  {tj,...,  tn},  where  tj  denotes  the  image  of  the  set  of  pails 
assembled  from  step  1  to  step  i.  This  denotation  immediately  leads  to  the  following 
relation: 


tj  c  tj,  where  j>  i>0.  (A. 2) 

Formally  tj  ean  be  defined  recursively  as  follows: 


Definition  3.2.  Let  X  and  Y  be  two  parts  of  an  assembly. 


1 )  if  s(i,  Y)  and  af X,  Y),  then  Xe  t*; 

2)  if  Xetj.j,  then  Xe  t;; 

3)  the  elements  of  tj  arc  generated  only  by  1)  and  2). 


If  we  use  bel(X,  tj)  to  denote  that  X  is  an  element  of  tj,  then  bel(X,  t,)  ean  be  defined 
as  follows. 

For  i=l,  we  have: 


For  m>i>2,  we  have: 


bel(X,  t j ):-  s(l,Y),  a(X,  Y).  (A.3) 

~bel(X,  t|):-  not  bel(X,  tj),  p(X).  (A.4) 

bel(X,tj):-s(i,  Y),  a(X,  Y).  (A.5) 

bcl(X,  tj):-  bel(X,  tj_r).  (A.6) 

~bel(X,  t,):-  not  bel(X,  t,),  p(X).  (A. 7) 


For  an  m-part  assembly,  there  will  be  nr  ground  instances  of  constraint  (A.3),  m 
ground  instances  of  constraint  (A.4),  nf(rn-l)  ground  instances  of  constraint  (A.5), 
and  m(m-l)  ground  instances  of  constraint  (A.6)  or  (A.7).  So  there  are  a  total  of 
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(m  +3m  '2m)  ground  constraints  instances.  With  this  definition  the  connectivity  con¬ 
straints  can  be  described  as  follows: 

if  s( i,  X)  holds ,  then  hel(XjiA)  must  hold;  or  equivalently , 
s(i ,  X)  and  ~bel(X,  t^\)  cannot  he  true  simultaneously. 

In  answer  set  programming,  the  constraints  can  be  described  as: 

:-  s(i,  X),  ~bel(X,  t, ,).  (m>i>2)  (A.8) 

Given  an  m-part  assembly,  there  will  be  (m-1)  constraints  of  form  (A.8),  and  each 
constraint  will  have  m  instances.  So  in  total  there  are  m(m-l)  instances.  Then  in  order 
to  define  connectivity  constraints,  we  need  [m(m-lHm +3m  2m]=(m'+3m  -3m) 
ground  constraint  instances,  which  is  a  great  improvement  compared  with  the 
straightforward  method  for  representing  the  same  constraints. 

Straightforward  Representation  for  Precedence  Constraints  (SR -PC) 

The  precedence  constraints  require  that  any  part  considered  for  assembling  cannot  be 
prevented  in  all  six  directions  by  the  previously  assembled  parts.  This  constraint  has 
clear  relationship  with  the  translation  function  and  therefore  the  predicate  pre(). 

Now  we  give  the  most  natural  representation  of  this  kind  of  constraint.  For  an  m- 
part  assembly,  we  first  introduce  a  set  of  constants  { a j, . . .  am},  where  a;  denotes  the 
set  of  parts  assembled  before  step  i  Apparently  there  is  no  parts  belonging  to  ai.  And 
the  membership  of  a  part  in  a*  (m>i>2)  can  be  defined  easily  as  follows. 

For  i=2,  we  have: 


bel(X,  a2):-s(l,X).  (B.l) 

For  m>i>2,  we  have: 

bcl(X,  a,):-s(i-l,X).  (B  2) 

bel(X,  a,)>  bel(X,  a,  j).  (B.3) 

Let’s  suppose  that  we  have  assembled  (i-1)  parts  of  an  assembly,  then  the  constraint 
for  assembling  the  i-th  (m>i>2)  part  can  be  described  as  follows: 

:-  s(i,  X),  pre(X,  1,  X,),  pre(X,  2,  X2),  pre(X,  3,  X3),  pre(X,4,X4), 
pre(X,  5,  X5),  pre(X,  6,  X6),bel(Xh  a,),  bel(X2,  ad,  bel(X3,  a,), 
bel(X4,  aj),  bel(X5,  ad,  behX6,  ad-  (B  ,4) 

Constraint  (B.4)  says  that  (a)  part  X  is  assembled  at  step  i,  and  (b)  X  cannot  be  as¬ 
sembled  at  step  i,  i.e.  in  all  six  directions  the  motion  of  X  is  prevented  by  some  previ¬ 
ously  assembled  part,  cannot  be  true  simultaneously. 

For  an  m-part  assembly  there  will  be  m  instances  of  (B.l),  m(m-2)  instances  of 
(B.2)  or  (B.3),  and  m7(m-l)  instances  of  (B.4).  A  huge  number! 

Improved  Representation  for  Precedence  Constraints  (IR-PC) 

The  space  complexity  of  the  SR-PC  motivates  us  to  find  more  efficient  representa¬ 
tion  method,  where  there  are  fewer  variables  present  in  each  constraint.  In  doing 
so  we  introduce  the  predicate  prevent^,  d,  X)  denoting  that  there  is  a  part  in 
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a,  which  prevents  part  X  in  direction  d.  Obviously  prcvcnt(a„  D,  X)  ean  be  defined  as 
follows: 

prcvcnt(aM  D,  X):-  pre(X,  D,  Y),  beI(Y.  Uj).  (m>i>2,  Dc{  1 . 6})  (B.5) 

With  this  predicate  the  constraint  (B.4)  can  be  rewritten  as: 

:-  s(i,  X),  prevcnt(a,,  1,  X),  prevent^,  2,  X),  prevent^,,  3,  X), 

prevent(aM  4,  X),  prevenUa,,  5,  X),  prevent(a,,  6.  X).  (B.6) 

The  number  of  ground  instances  of  constraints  (B.5)  is  6m2(m-I);  and  the  number  of 
instances  of  constraint  (B.6)  is  ra(m-l).  So  in  total  we  have  m+2m(m-2)+6m  (m- 
I)+m(m-I)=(6m3-3m2-4m)  ground  instances  for  constraints  (B  1),  (B.2),  (B.3),  (B.5) 
and  (B.6).  This  is  a  great  improvement  over  the  straightforward  formulation,  where 
there  are  (m'-m  +2m -3m)  constraint  instances. 

3.4  ASP  Programs  for  Assembly  Sequences  Generation 

Given  an  assembly  and  its  corresponding  connection  graph  and  translation  function, 
the  rules  and  constraints  discussed  in  the  above  sections  can  be  created  automati¬ 
cally.  Those  rules  and  constraints  will  constitute  an  ASP  program  for  generating  all 
feasible  assembly  sequences  for  an  assembly.  Each  program  is  divided  into  EDB 
consisting  of  an  assembly's  contact  and  interference  information  and  IDB  consisting 
of  rules  and  constraints  a  feasible  assembly  sequence  must  satisfy.  Note  that  the  rules 
and  constraints  in  an  IDB  are  general  in  that  they  arc  shared  by  all  assemblies  with 
the  same  number  of  parts;  and  EDB,  however,  is  the  component  that  is  exposed  to 
frequent  changes  in  the  product  design  process.  In  the  following  sections,  an  ASP 
program  in  the  language  of  D LA7s model s/emode Is  is  ealled  a  DLV/smodels/emodels 
program. 

Seetion  3.3  has  presented  straightforward  and  improved  representations  for  connec¬ 
tivity  constraints  and  precedence  constraints.  Each  combination  of  the  representation 
methods  for  the  two  types  of  constraints  leads  to  an  ASP  based  method  for  MASP.  So 
there  will  be  four  MASP  methods  that  use:  I)  SR-CC  and  SR-PC,  2)  SR-CC  and  IR- 
PC,  3)  IR-CC  and  SR-PC,  and  4)  IR-CC  and  1R-PC.  Here  we  are  mainly  concerned 
with  the  first  and  fourth  methods.  In  what  follows  the  two  methods  will  be  denoted  by 
Straightforward  Method  (SM)  and  Improved  Method  (IM),  respectively.  The  space 
complexity  of  knowledge  representation  in  SM  and  1M  is  presented  in  table  I. 

Table  1.  Space  complexity  of  the  knowledge  representation  in  SM  and  IM.  where  m  is  the 
number  of  parts  of  the  assembly 


MASP 

Method 

Representation 

method 

Needed  constraints 

Total  Number  of 
ground  instances 

Space  com 
plexity 

SM 

SR-CC 

(A  I) 

£n--2m(mn) 

0{mm) 

SR-PC 

(B 

m*-m  +2m  -3m 

IM 

IR-CC 

> 

i 

> 

be 

nf +3  nr -2m 

cW) 

IR-PC 

(B.l)  -  (B.3).  (B.5),  (B.6) 
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From  table  1 ,  it  can  be  seen  that  a  DLV  program  that  adopts  SM  has  an  exponential 
space  complexity,  while  a  DLV  program  that  adopts  IM  has  polynomial  space  com¬ 
plexity  0(m3),  In  section  4  we  will  further  investigate  the  performance  of  the  two 
methods  on  a  collection  of  assemblies. 

Take  the  four-part  assembly  in  Fig.  1  as  an  example,  the  DLV  program  that  uses  IM 
for  constraints  representation  is  shown  in  Fig.2. 


%  FDR  consisting  of  the  assembly's  connection  and  interference  information 

a(  1 ,2).a<  1 ,3).a(  1 ,4).a(2,4).a(3,4). 

%  the  assembly's  interference  relation. 

pre(2, 1 , 1  ).pre(2.2, 1  ).pre(2.5. 1  ).pre(3,2, 1  ).prc(4. 1 . 1  ).pre(4.2, 1  ).pre(4.3, 1  ).pre(4.4, 1  ).pre(4.6. 1  ).pre(  1 ,2,2).pre(  1 .4,2 ). 
prc(l  .5.2>.pre(3,2,2).pre(4.1,2).pre(4,2,2).prc<4,3,2).pre(4.4,2).prc(4.6,2).prc(  1.5,3).prc(2,5.3)prcM.I,3).prc(4.2,3). 
pre(4.3,3).pre(4,4,3).prc(4,6,3).prc(  1  ,l,4).prc(l,3,4).pre(l  ,4.4).prc(  l,5,4).prc(  1.6  4).prc{2,  l.4).prc(2,3.4).pre(2,4.4). 
pre(2,5,4)pre(2,6,4).pre(3,l  ,4).prc(3,3,4).prc(3.4,4).prc(3,5,4).pre(3,6.4). 

%  IDR  consisting  of  constraints  a  feasible  assembly  sequence  must  satisfy . 
p(l).  p{2).  p(3).p(4).  a(X,  Y):-  a(Y,  X). 

the  constraints  that  any  part  must  be  assembled  at  least  once. 
s(  1, 1 )  v  s(l,2)  v  s(l,3)  v  s(l,4).  s(2.1)  v  s(2,2)  v  s(2,3)  v  s(2.4). 

s(3,l )  v  s(3,2)  v  s(3.3)  v  s(3.4).  s(4.1)  v  s(4,2)  v  s(4,3)  v  s(4,4). 

%  the  constraints  that  any  part  can  be  assembled  only  once. 

s(Y,  X),  s(Z.  X),  Y!-Z. 

%  connectivity'  constraints. 

-s(2,X),  -beKX.tl).  :-s(3.X).  -bel(X,l2).  :-s(4.X),  ~bcl(X,i3). 

bel(X.tl):-  s(l,Y).  a(X,Y).  ~bel<X,ll):-  not  bel(X,tl),  p(X). 

bd(X,t2):-  s(2,Y),  a(X,Y).  bcl(X.t2):-  bel(X,tl ).  ~bcl(X.t2):-  not  bcl(X.t2),  p(X). 

bcl(X,t3)>  s(3.Y).  a(X,Y).  bcl(X.t3):-  bel(X,t2).  -bd<X.t3):-  not  hcl(X,t3),  p(X). 

%  precedence  constraints. 

s(2.X),  prcvent{a2,l  ,X),  prevent(a2,2.X).  prevent(a2,3.X),prevent(a2,4,X).prevent(a2.5.X),prcvcnt(a2.6.X). 
s(3,X).  prevent!  a3,l,X),  prcvcnt(a3,2,X).  prevent!  a3,3,X),prevent(a3,4.X),prevent(a3, 5. X),prevcnt(a3,6.X). 
s(4,X).  prevent(a4.1  ,X),  prevcnt(a4.2.X).  prcvem(a4,3,X),prcvent(a4,4,X),prevent(a4.5.X),prcvcnt(a4,6.X). 
prevcnt(a2.D.X):*  pre(X.D.Y),  bcl(Y.a2).  prevcnt(a3.D,X)  -  prc(X.D.Y),  bcl(Y.a3). 

prcvcnl(a4.D.X):-  prc(X.D.Y),  bel(Y,a4).  bcl(X.a2):-  s(l,X).  bcl(X.a3):-  s(2.X).  bcl(X,a3):-  bel(X.a2). 

bel(X,a4):-  s(3,X).  Sd(X,a4)  hel(X.a3). 


Fig.  2.  A  DLV  program  for  generating  assembly  sequences  for  the  assembly  in  Fig.  1  with  IM 


Take  the  program  in  Fig, 2  as  input,  DLV  calculates  4  answer  sets  for  the  program, 

{ s(  1 , 1 ),  s(2,2),  s(3,3),  s(4,4) }  { s(  1 , 1 ),  s(2,3),  s(3,2),  s(4,4) } 

{ s(  1 ,2),  s(2,l),  s(3,3),  s(4,4) }  (s(I,3),  s(2,l),  s(3,2),  s(4,4)} 

Each  answer  set  listed  above  corresponds  to  a  feasible  assembly  sequence. 

If  we  want  to  change  the  design  of  the  assembly  by  altering  the  size  of  some  parts, 
the  contact  and  interference  information  may  be  changed.  In  this  case  we  only  need  to 
substitute  the  EDB  with  a  new  EDB  and  reuse  the  original  1DB. 

For  example,  if  the  connection  information  of  the  EDB  is  changed  to 

{ a(  1 ,2).a(  1 ,3).a(  1 ,4),a(3,4), } 

we  only  have  to  run  the  same  IDB  in  Fig, 2  on  this  new  EDB  to  find  a  feasible  assem¬ 
bly  sequence. 


4  Experiments 

We  have  written  C  programs  which,  given  a  connection  graph  and  a  translation  func¬ 
tion  for  an  assembly,  automatically  generate  the  ASP  program  for  generating  all  the 


Using  ASP  to  Improve  the  Information  Reuse  in  MASP 


395 


assembly  sequences  for  the  assembly.  Two  experiments  are  performed.  Firstly,  SM 
and  IM  are  implemented  in  DLV  language,  whose  performance  are  tested  on  several 
assembles  with  various  number  of  parts.  The  results  are  shown  in  the  first  four  rows 
of  table.  1;  Secondly,  we  compare  the  performance  of  IM  implemented  in  the  lan¬ 
guages  of  DLV,  smodels,  and  emodels,  respectively.  The  results  are  shown  in  the 
remaining  rows  of  table. 2.  In  both  experiments,  we  are  interested  in  the  time  and 
memory  efficiency  of  the  two  methods.  The  computer  we  use  has  a  Pentium(R)4  CPU 
3.00GH/  and  1.0  GB  memory.  The  situations  for  memory  requirements  for  assem¬ 
blies  are  obtained  by  monitoring  the  Windows  Task  Manager  and  the  main  concern  is 
whether  the  memory  resources  run  out  on  an  assembly. 

Table  2,  Performance  of  SM  and  IM,  where  lhe  row  time  in  column  ///  denotes  the  time  in 
seconds  for  generating  one  assembly  sequence  for  the  ///  part  assembly  chosen  for  our  experi¬ 
ment,  and  time-out  indicates  that  the  runtime  exceeds  a  limit  of  5  hours 


Number  of  parts  in  the  assembly 

4 

6 

8 

11 

14 

15 

16 

20 

SM 

DLV 

lime(s) 

0.59 

1.03 

202.2 

time-out 

lime-out 

time-out 

time-out 

time  out 

memory  out 

no 

no 

no 

yes 

yes 

yes 

yes 

yes 

IM 

Dl.V 

time(x) 

0.02 

0.02 

0.05 

0.48 

3.81 

3.3.5 

0.39 

time-out 

memory  out 

no 

no 

no 

no 

no 

no 

no 

no 

smodels 

titnefs) 

0.05 

0.09 

0.19 

0.78 

129.57 

83.56 

30.77 

time-out 

memorx  out 

no 

no 

no 

no 

no 

no 

no 

no 

emodels 

time(s) 

0.04 

0.09 

0.20 

0.28 

3  21 

0  62 

4.84 

6  37 

memory  our 

no 

no 

no 

no 

no 

no 

no 

no 

The  results  of  first  experiment  shows  that  the  SM  implemented  in  DLV  language 
leads  to  system  memory  out  quickly,  and  the  IM,  however,  can  produce  a  feasible 
assembly  sequence  in  a  short  time  on  the  assemblies  with  no  larger  than  16  parts. 

The  second  experiment  shows  that  DLV  outperforms  smodels  on  all  experimental 
assemblies;  and  emodels  is  more  efficient  than  DLV  on  assemblies  with  more  than  10 
parts*,  with  only  one  exception  of  the  16-part  assembly.  Typically,  when  the  number 
of  parts  of  an  assembly  reaches  20,  both  DLV  and  smodels  cannot  produce  a  feasible 
sequence  in  5  hours;  emodels,  however,  gives  a  solution  in  a  fairly  short  time  (6.37 
seconds).  This  performance  is  acceptable  for  MASP  since  it  is  not  time  critical. 

Since  the  space  complexity  of  IM  is  polynomial,  the  above  results  suggest  two 
ways  to  apply  IM  to  larger  scale  MASP  problems.  Using  a  fast  answer  set  solver  such 
as  emodels  is  the  most  obvious  candidate,  but  this  method  has  limited  applications 
since  it  can  be  seen  that  the  time  complexity  of  IM  using  emodels  is  not  linear.  The 
other  way  is  to  make  use  of  the  subassembly  identification  techniques  to  divide  an 
assembly  into  several  subassemblies,  each  of  which  has  a  small  number  of  parts  and 
therefore  the  feasible  assembly  sequences  for  which  can  be  found  quickly.  The  overall 
assembly  sequence  can  be  obtained  by  concatenating  the  assembly  sequences  of  each 
individual  subassembly.  The  second  way  is  very  promising  and  has  the  potential  to 
make  ASP  based  MASP  method  competitive  for  industrial  production  processes. 


5  Conclusion  and  Discussion 

This  paper  proposes  an  ASP  based  method  for  solving  the  NP-complcte  assembly 
sequence  planning  problem,  with  the  aim  to  improve  the  information  reuse  in  this 
process.  The  division  of  an  ASP  program  into  LDB  and  1DB  provides  a  natural  scheme 
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for  this  purpose.  It  is  shown  that  once  the  IDB  component  is  created  for  an  assembly,  it 
is  rcused/shared  by  all  assemblies  with  the  same  number  of  parts.  Compared  with  other 
approaches  for  assembly  sequence  planning,  this  is  a  great  advantage. 

Experiments  are  conducted  to  test  the  performance  of  our  methods  by  using  differ¬ 
ent  answer  set  solvers  It  is  shown  that  cmodels  out-performs  DLV  and  smodels  on 
most  non-trivial  assemblies,  and  is  a  very  promising  tool  for  solving  MASP  problems. 
In  the  future,  we  will  integrate  our  method  with  existing  CAD  systems,  which 
will  make  it  possible  to  measure  the  information  reuse  brought  by  our  method  in  the 
product  development  process. 
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Abstract.  Manifold  learning  has  been  successfully  used  for  finding  dom¬ 
inant  factors  (low-dimensional  manifold)  in  a  high-dimensional  data  set. 
However,  most  existing  manifold  learning  algorithms  only  consider  one 
manifold  based  on  one  dissimilarity  matrix.  For  utilizing  multiple  mani¬ 
folds,  a  key  question  is  how  different  pieces  of  information  can  be  in¬ 
tegrated  when  mult  iple  measurements  are  available.  A  mar  i  proposed 
a- integration  for  stochastic  model  integration,  which  is  a  generalized 
averaging  method  that  includes  as  a  special  case  arithmetic,  geomet¬ 
ric,  and  harmonic  averages.  In  this  paper,  we  propose  a  new  generalized 
manifold  integration  algorithm  equipped  with  o-iutegration,  manifold  o- 
intajration  (MAI).  Interest  ingly,  MAI  can  be  shown  to  be  a  generalization 
of  other  integration  methods  (that  may  or  may  not  use  manifolds)  like 
kernel  fusion  or  mixture  of  random  w?alk,  Our  experimental  results  also 
confirm  that  integration  of  multiple  sources  of  information  on  individual 
manifolds  is  superior  to  the  use  of  individual  manifolds  separately,  in 
tasks  including  classification  and  sensorimotor  integration. 


1  Introduction 


In  data  analysis,  it  is  important  to  understand  the  structure  of  the  data,  which 
can  l)c  described  as  a  manifold.  Manifold  learning  involves  inducing  a  smooth 
nonlinear  low-dimensional  manifold  from  a  set  of  data  points  drawn  from  the 
manifold  that  is  embedded  in  a  high-dimensional  space.  Various  manifold  learn¬ 
ing  methods  have  been  developed  and  have  drawn  much  attention  in  pattern 
recognition  and  signal  processing  1  .  However,  most  existing  manifold  learning 
algorithms  only  consider  one  manifold  based  on  one  dissimilarity  (or  distance) 
matrix.  Since  different  measurements  generate  data  sets  on  different  manifolds, 
the  resulting  manifold  needs  to  be  integrated  into  one  to  use  all  the  structural 
information  from  different  measurements  in  the  framework  of  manifold  learning. 


13. -T.  Zhang  and  M.A.  Orgnn  (Eds.):  PRICA.I  2010.  LNAI  6230,  pp.  397  408.  2010 
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How  can  different  measurements  given  by  different  distance  matrices  be  used 
together  to  form  an  integrated  manifold?  A  key  question  here  becomes  how  dif¬ 
ferent  pieces  of  information  can  be  integrated.  In  pattern  recognition  systems, 
data  integration  has  been  an  important  issue  to  improve  accuracy  relative  to  a 
single  source  of  information  because  one  sensor  might  not  be  good  enough  to  pro¬ 
vide  unambiguous  information.  Some  algorithms  have  been  applied  to  integrate 
multiple  sources  of  information  (see  [2]  and  references  therein).  However,  each 
integration  algorithm  works  optimally  only  with  specific  types  of  data  sets.  A 
more  general  approach,  o-integration,  was  proposed  by  [3]  for  stochastic  model 
integration  of  multiple  positive  measures.  It  is  a  one-parameter  family  of  integra¬ 
tion,  where  the  single  parameter  a  determines  the  characteristics  of  integration. 
Given  a  number  of  stochastic  models  in  the  form  of  probability  distributions, 
it  finds  the  optimal  integration  of  the  sources  in  the  sense  of  minimizing  a- 
divergenee  [3]. 


Fig.  1.  An  example  of  manifold  integration  when  two  manifolds  are  available  from  one 
data  set.  The  two  manifolds  are  from  different  measurements  (color  or  size),  and,  cadi 
taken  separately,  is  not  suitable  for  understanding  the  data  set  perfectly.  However,  we 
can  integrate  the  two  measurements  to  obtain  one  integrated  manifold  which  gives  a 
complete  picture  of  the  data  set. 


Motivated  by  advances  in  manifold  learning  and  data  integration,  in  this  pa¬ 
per,  we  propose  a  uew  manifold  integration  algorithm,  manifold  a -integration 
(MAI)  that  combines  the  manifold  learning  and  data  integration  approaches  as 
in  Fig.  1.  We  show  that  our  method  includes  as  its  special  ease  previous  methods 
such  as  the  use  of  statistical  distance  [4],  kernel-based  data  fusion  [5]  or  mixture 
of  random  walks  [6],  by  analyzing  the  compromised  distances  oil  the  integrated 
manifold.  Our  experimental  results  with  four  data  sets  including  real  world  data 
sets  show  promising  results.  Notably,  we  show  that  MAI  can  be  applied  to  sen¬ 
sorimotor  integration  (cf.  [7]). 
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2  Review  of  a-Integration 


First,  we  provide  a  brief  overview  of  a- integration,  more  details  on  which  can 
be  found  in  [3].  One  exemplary  application  of  o-intcgration  can  be  found  in  [8] 
where  a-integration  successfully  generalizes  evidence  theory.  Let  ns  consider  two 
positive  measures  of  random  variable  j*,  denoted  by  /Ui(j’)  >  0  and  m*2(.r)  >  0 
for  /  =  1,2.  a -mean  [3]  is  a  one-parameter  family  of  means,  defined  by 

mn(x)  =  /“'  Q{/a(«»i(ar))  +  /o(m2(.r))} j  ,  (1) 

where  /„(■)  is  a  differentiable  monotonic  function  given  by 


/.(*) 


2  1  2°  .  (\  ^  ]  . 
logz,  a  =  1. 


(2) 


The  function  /<*(■)  in  Eq.  (2)  is  the  only  function  that  enables  nr- mean  to  be 
linear  scale  free,  for  c  >  0  i.e.„  a-inean  of  cm\(x)  and  cm-2(x)  is  cihn(z),  since 

cmn{x)  =  /"'  Q{/0(crni(x))  +  /„(cm2(x))}^  .  (3) 

o-meau  includes  various  commonly  used  means  as  its  special  case:  for 
ft  =  —  1, 1 , 3.  oo  or  —  oo,  a-rnean  becomes  arithmetic  mean,  geometric  mean  har¬ 
monic  mean,  minimum,  or  maximum,  respectively.  The  value  of  the  parameter  a 
(which  is  usually  specified  in  advance  and  fixed)  reflects  the  characteristics  of  the 
integration.  As  a  increases,  ft-inean  resorts  more  to  the  smaller  of  m\ (x)  or  ue>(j-), 
while  as  a  decreases,  the  larger  of  the  two  is  considered  with  more  weight  [3]. 
a-rnean  can  be  generalized  to  the  weighted  a-inixture  of  M  positive  measures 

rn\(x) . 7nnf(x)  with  weights  w  =  [uq,  . . . ,  wm ]•  which  is  referred  to  as 

a  •integration  of  m\  (:r), . . . ,  iiim{x)  with  weights  w  [3]. 


Definition  1  (a-integration).  The  a  -integration  ofm,(x),  /  =  1,...,A/,  with 
weights  re,  is  defined  by 


M 


rTi(j-)  =  /„  1  v)) 


(4) 


\t=  1  / 

where  w ,  >  0  for  i  =  1, . . . ,  M  and  \  wi  =  1  • 

Given  M  positive  measures.  wt(x),  i  =  1  the  goal  of  integration  is 

to  seek  their  weighted  average  tli(x)  that  is  as  close  to  each  of  the  measures  as 
possible,  while  bow  close  two  positive  measures  arc?  is  evaluated  using  a  diver¬ 
gence  measure.  It  was  shown  by  [3]  that  a-integration  ?7i(.r)  is  optimal  in  the 
sense  that  the  risk  function 


M 

Ja  [m(x)]  =  Wi£>„ [m,-  ( x)  II  n< (.(•)]  (5) 

1  =  1 


is  minimized,  where  Dn[tfii(x)  ||  iri(z)  is  the  a-diveiyene<  of  /7/(.r)  from  the  mea¬ 
sures  Wi(x)  [3]. 
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3  Manifold  a-Integration 

In  this  section,  we  propose  a  new  manifold  integration  method  using  a-integration, 
wliieh  leads  to  manifold  a-integration  (MAI),  and  we  show  that  it  includes  previ¬ 
ous  integration  methods  as  a  special  ease. 


3,1  Algorithm:  MAI 


Let  G  he  a  weighted  graph  with  N  nodes,  representing  a  manifold.  Then,  the 
distance  between  two  nodes  on  the  kth  manifold,  ean  be  transformed  into 

probability  P(j\  the  transition  probability  from  the  ith  node  to  the  jth  node 
on  the  A’th  manifold.  We  simply  use  the  Gaussian  kernel  which  is  given  by 


p(*0  _ 

U 


D(k)2 

(7(A-)2 


(6) 


where  is  a  normalization  term  so  that  the  sum  of  transition  probabilities 
from  the  ith  node  to  all  other  nodes  on  the  kth  manifold  is  1,  and  is  a 
parameter  representing  the  standard  deviation.  Given  C  dissimilarity  matrices, 
D  ]  \  D  •  •  D  (  wo  can  get  C  probability  matrices,  P  \P\-,P 

There  are  two  approaches  in  using  a-integration  on  multiple  manifolds:  (1) 
using  transition  probability  matrices  and  (2)  using  distance  matrices.  First,  the 
transition  probability  from  the  ith  node  to  t lie  jth  node  on  the  A;tli  manifold  is 
given  by  Eq.  (G).  So,  given  P  P  2\  •  •  •  ,P  ,  with  a-integration,  the  com¬ 
promised  probability  is  given  by 


c 


Pa.ij  —  y 


(7) 


Kk=\ 


where  Za  i  is  a  normalization  term.  From  the  compromised  probability  Pa,  we 
can  reconstruct  the  compromised  dissimilarity  Dnp  as  follows. 


D„p<ij  =  <x*  log(PQ,y), 


(8) 


where  a*  is  the  average  of  k  —  1, C. 

Then  we  use  kernel  Isomap  [9].  Given  a  distance  matrix.  D<vp.  we  substitute 
for  Dn p,  which  is  given  by 


A*Pl tj  =  Da PtiJ  +  c(l  -  Sij),  (9) 

where  StJ  is  the  Kroneeker  delta.  Here,  c  is  the  solution  of  constant-shifting 

^  2 

method  [10]  to  make  the  doubly  centered  kernel  matrix  K  =  -^HDnpH 

~  2  .  - 

positive  semi-definite.  Here,  Dap  is  the  element-wise  square  of  Dav  and  H  = 
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I  —  jjC\eJN.  where  ev  ~  [1  ...  L  G  RjV .  Finally,  after  eigen-decompositiori, 
K  =  VAV  .  projection  mapping  Y  is  given  by 

y  =  vaK  (10) 

A  more  interesting  approach  is  to  apply  a-integration  to  the  distance  matrices  di¬ 
rectly.  Given  C  dissimilarity  matrices,  D.D  •  •  D  '  \  we  can  reconstruct 
the  compromised  dissimilarity  D(vi\  directly  by  a-integration  without  considering 
the  transition  probability.  It  is  given  by 

A,d.y  =/"'  ^E'n-UO^)^  .  (li) 

Note  that  Eq.  (7)  is  an  a-integration  of  the  probabilities,  while  Ecp  (11)  is 
an  a-integration  of  the  distances.  So.  with  Eq.  (7).  the  compromised  distance 
in  Eq.  (8)  is  different  from  that  of  Eq.  (11).  We  derive  two  slightly  different 
manifold  integration  methods  from  these  two  different  integration  approaches, 
and  call  the  two  versions  MA1P  and  MAIa,  respectively.  After  getting  Da  (either 
Da p  or  Dnt i).  the  rest  is  the  same  as  kernel  Isomap  [9],  bv  which  our  method 
inherits  the  dimensionality  reduction  property.  Note  again  that  since'  MAI  uses 
kernel  lsornap  after  obtaining  the  compromised  distance  matrix,  it  inherits  the 
projection  property  of  kernel  lsornap  which  involves  the  projection  of  novel  data 
points  onto  the  associated  low-dimensional  space.  One  to  limited  space,  we  do  not 
derive  the  equations  for  the  projection  here.  The  derivations  for  the  projection 
property  can  be  found  in  [9]. 

3.2  Comparison  with  Existing  Data  Integration  Approaches 

We  analyze  MAIP  and  MAIa,  comparing  them  to  previous  methods  [4,5.G]  even 
though  some  of  them  are  not  immediately  about  manifold  integration. 

Case  1:  In  random  walk  on  multiple  manifolds  (RAMS)  1  ,  the  compromised 
transition  probability  matrix  P*  is  simply  given  by  multiplication  of  the  source 
probabilities.  Approximately,  this  is  a  special  case  of  MAIp  in  Eq.  (7)  with  o  =  1 
and  uniform  weights  for  all  manifolds: 


where  Zy A  is  a  normalization  term.  Then,  the  compromised  distance  in  Eq.  (8) 
is  reconstructed  by 

£>iP,y  =  v*  ylogZy.,  -  ^  log  /'*.  (13) 

which  is  almost  the  same  as  the  compromised  distance  in  RAMS  except,  for  the 
normalization  term  and  That  is.  RAMS  is  approximately  a  special  case  of 
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MAlp  with  a  —  1  and  uniform  weights.  If  we  relax  the  assumption  on  the  weights 
so  that  the  sum  of  weights  is  not  1  blit  Y^k-  \  Wk  =  ^  then  MA1P  leads  to  an 
exactly  the  same  result  as  RAMS. 

Now,  we  can  check  the  case  when  a- integration  is  applied  to  the  distance 
matrices  directly  (MAld).  When  a  =  —  3  and  the  weights  are  given  by  ,  the 
compromised  distance  matrix  of  MAld  in  Eq.  (11)  is  given  by 


R-  3d,i? 


(14) 


which  is  the  same  as  the  the  compromised  distance  matrix  D*  in  RAMS  except 
the  normalization  terms.  Here,  the  weights  can  serve  as  normalization  terms 
for  different  units  across  measurements. 


Case  2:  [5]  used  a  weighted  sum  of  kernel  matrices  for  kernel-based  data  fusion. 
For  the  special  ease  when  a  =  —3  and  weight  Wk  for  the  fcth  manifold  applied 
to  MAld,  the  corresponding  kernel  matrix  is  given  by  just  the  weighted 

average  of  the  kernel  matrices  as  follows: 

i  c 

K MAId  =  -tH{D  3d ?H  =  Y.  wkK(k)-  (15) 

k—  1 

where  K(k)  =  -\HD(k)7H  Notice  that  the  last  term  in  Eq.  (15)  is  the  kernel- 
based  data  fusion  proposed  in  [5]  which  can  now  be  seen  as  a  special  ease  of 
MAld-  It  was  shown  that  manipulating  the  distance  matrix  gives  a  better  result 
than  manipulating  the  kernel  matrix  directly  [9].  In  other  words,  the  integrated 
space  of  MAld  can  be  better  than  (or  at  least  equal  to)  the  kernel- based  data 
fusion  methods  when  the  a  value  is  carefully  chosen. 

Case  3:  Also,  in  [6],  even  though  they  did  not  discuss  directly  about  manifold 
integration,  a  mixture  of  random  walks  was  used  as  an  integration  method.  With 
a  =  —  1  and  different  weights  for  the  A*th  manifold.  MAIP  has 

p-ulj  =  y-Yw^ptk)-  (iC) 

Z  bi  k 

which  is  a  mixture  of  random  walks. 

In  sum,  we  cheeked  three  previous  methods  for  data  integration  and  compared 
them  with  our  two  proposed  approaches.  The  previous  approaches  all  turn  out 
to  be  (approximately)  a  special  ease  of  our  proposed  method. 


4  Experiments  and  Results 

In  order  to  show  the  effectiveness  of  our  method,  we  carried  out  experiments 
with  four  different  data  sets:  (1)  disc  data  set  made  of  100  discs  with  different 
colors  and  sizes  [4];  (2)  head-related  impulse  response  (HRIR)  data  [11];  (3)  the 
CMU  ARCTIC  speech  database  [12];  and  (4)  sensorimotor  integration. 
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4.1  Disc  Data 

We  used  an  artificial  disc:  data  set  to  show  the  differences  between  the  three 
methods:  (1)  RAMS,  (2)  MAI  on  transition  probability  matrices  (MA1P),  and 
(3)  MAI  on  distance  matrices  (MAIj).  Let  X  G  R2x  100  be  the  discs'  locations. 
The  first  row  X\  and  the  second  row  X2  are  the  coordinates  for  color  and  size, 
respectively.  From  this  disc  data  set,  each  distance  matrix  is  obtained  by  only 
color  or  by  size,  respectively,  and  squared  as  follows. 

D$  =dist(Xk,i,Xkj)2. 

Fig.  2  show's  the  data  set  and  three  integrated  spaces  from  the  three  methods.  If 
we  use  a  =  l  and  a  =  —3  for  MAIP  arid  MAIj,  respectively,  then  the  results  of 
MAI  are  the  same  as  RAMS.  Here,  we  chose  the  a  values  to  bo  0.89  and  -1.25  for 
MAlp  and  MAI(j,  respectively.  Note  that  the  integrated  space  from  MAI  have 
almost  a  square  shape,  which  is  supposed  to  be  like  that,  while  RAMS  has  a 
fat  square  even  though  it  found  a  “properly’  integrated  space  where  color  and 
size  are  two  dominant  coordinates.  In  addition.  MAIa  found  almost  the  same 
result  as  the  original  set,  while  MAIP  has  a  little  denser  dots  around  the  center 
of  the  space.  Even  though  MAIP  generalizes  RAMS  with  the  same  transition 
probability  matrices,  the  reconstructed  distance  matrix  is  not  optimal  in  the 
a-integration  sense,  because  the  transition  probability  equation  in  Eq.  (G)  is 
combined  into  /a,  which  is  not  a  linear  scale  fret'  function  of  distance  any  more. 
This  can  be  why  MAIp  has  a  little  distortion  in  the  integrated  space,  even  though 
it  is  still  better  than  RAMS. 


(a)  Data  (b)  RAMS  (c)  MA1P  (<I)  MAIa 


Fig.  2.  Disc  data  sot  (a)  and  three  integrated  spaces  (b-d);  (a)  original  data  set,  100 
discs,  (b)  RAMS,  (c)  MAIP  with  a  =  0.89,  and  (d)  MAIa  with  a  =  -1.25 

4.2  HRIR  Data 

In  this  experiment,  we  used  the  public-domain  CTPIC  14  RTF  data  set  [11]  and 
applied  kernel  Isomap  to  each  ear’s  HRIR  data  to  generate  a  2-dimensional 
manifold  for  each  ear.  Then  we  applied  MAIa  to  integrate  the  two  manifolds.  The 
detailed  description  for  the  HRIR  data  sets  can  be  found  in  [11  .  We  mainly  pay 
attention  to  the  HRIRs  involving  sound  sources  specified  by  different  elevation 
angles.  The  database  contains  HRIRs  sampled  at  1250  points  around  the  head 
for  45  subjects.  Azimuth  is  sampled  from  —80°  to  80°  and  elevation  from  —45° 
to  230.625°.  Each  HRIR  is  a  200-diinensiomil  vector  corresponding  to  a  duration 
of  about  4.5ms. 


(a)  Left  only 


(b)  Right  only 


(c)  Integrated 


Fig.  3.  Embedded  manifolds  of  (a)  left  HR1R  (b)  right  HRIR  and  (c)  integrated  HRIR. 
Even  though  the  left  HRIR  is  seriously  distorted  and  the  right  one  is  also  not  smooth, 
the  integrated  space  shows  a  very  smooth,  low  error  result,  due  to  the  use  of  both 
pieces  of  information. 


Fig.  3  shows  the  performance  of  our  method  MAld  with  a  =  —0.5  on  20th 
subjects  in  the  data  set.  MAIj  was  applied  to  the  distance  matrices  of  (a)  and 
(b).  Either  (a)  or  (b)  is  not  perfect  for  locating  the  sound  source.  The  integrated 
result  is  better  than  the  two  results  considered  separately,  as  to  where  the  sound 
source  is.  Note  that  the  embedded  manifolds  in  Fig.  3  have  some  ambiguities 
like  up-down  or  front-back. 

4.3  Speech  Data 

We  carried  out  numerical  experiments  with  the  CMU  ARCTIC  speech  database 
[12<  in  order  to  show  an  integrated  manifold  from  multiple  manifolds  which  leads 
to  a  speaker  independent  phoneme  space  and  to  show  the  benefit  of  the  inte¬ 
grated  space  for  phoneme  classification.  The  CMU  ARCTIC  database  was  con¬ 
structed  as  a  phonetically  balanced,  US  English  single  speaker  database  designed 
for  unit  selection  speech  synthesis  research.  The  database  includes  US  English 
male  (‘bdl\  Tins’)  and  female  (‘slt\  ‘clb’)  speakers,  each  speaking  a  bunch  of 
sentences.  From  the  sentences,  we  extracted  four  phonemes,  ’AH’,  ’EH\  TH1  and 
’OW’  for  each  speaker  and  converted  each  phoneme  into  Mel  frequency  cepstral 
coefficients  (MFCCs),  which  served  as  our  feature  vectors. 

Speaker  independent  phoneme  space.  First,  we  found  one  map  of  these 
vowels  from  four  speakers’  four  vowels,  where  each  phoneme  consisted  of  300 
sample  data  points.  Fig.  4  shows  two  speakers’  individual  maps  from  kernel 
Isomap.  Even  though  they  pronounced  the  same  phonemes,  their  maps  are  dif¬ 
ferent  from  each  other.  Furthermore,  the  clusters  of  phonemes  are  not  well  sep¬ 
arated  even  in  each  map,  since  each  map  represents  both  linguistic  information 
and  speaker  dependent  information. 

On  the  other  hand,  Fig.  5a  is  the  integrated  map  from  four  speakers'  maps, 
and  it  shows  well  defined  clusters  of  phonemes,  which  means  that  this  map 
represent  the  phoneme  information  but  not  speaker  dependent  information. 
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(a)  Speaker  ‘bdl’  (b)  Speaker  ‘sit  * 

Fig.  4.  Individual  mapping  of  (a)  speaker  kbdl*  and  (I>)  speaker  ‘sit.*  using  kernel 
Isornap.  Two  maps  look  different  because  they  are  from  different  speakers  even  though 
they  are  for  the  same  set  of  phonemes. 


Classification.  After  getting  the  integrated  map  of  phonemes,  we  tried  to  use 
this  rnap  for  classification.  We  tested  it  with  fi  different  training  data  sizes.  For 
each  speaker's  individual  phoneme,  we  randomly  selected  50,  100.  150,  200.  250. 
and  500  samples  for  training  and  the  rest  for  testing.  For  each  trial,  we  repeated 
the  experiment  30  times  with  randomh  chosen  data  points  and  averaged  them 
Fig.  5b  shows  the  classification  results  with  the  quadratic  classifier.  From  this 
figure,  we  can  see  that  other  speakers'  informat  ion  is  helpful  for  phoneme  classi¬ 
fication  as  long  as  the  training  data  set  is  larger  than  a  certain  size.  The  average 
of  classification  rate  for  individual  speaker  data  converges  to  73.8(X  when  300 
phonemes  are  used  for  training  data,  whereas  MAla,  especially  with  rv  —  3, 


°  50  100  150  200  250  300 

Amount  of  training  data  for  each  phoneme 


(a)  Integrated  map  (1))  Classification 

Fig.  5.  Integrated  manifold  (a)  The  map  through  MAI<j  with  a  =  3  where  the 
phonemes  are  better  clustered  within  class  and  separated  from  each  other  across  class, 
(b)  Hit  rates  for  MAI,i  (squares,  triangles  and  circles)  and  individual  map  (crosses). 
After  around  50  phonemes  for  training.  MAla  becomes  better  than  the  individual  map. 
As  shown  above,  RAMS  and  kernel-based  data  fusion  are  (approximately)  special  cases 
of  MAla  with  a  =  —  3  (squares). 
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reaches  7G.4%.  Note  that  the  performance  changes  as  the  a  value  ehanges  and 
we  can  pick  the  best  one  to  get  better  results.  Here,  a  =  3  is  the  best  among 
integer  values  for  o,  whieh  might  be  supported  by  maximum  likelihood  estimate 
(MLE).  If  we  assume  that  each  measurement  has  a  Gaussian  distribution  which 
is  almost  the  ease  for  each  phoneme  in  Fig.  5a,  the  MLE  of  the  variance  for  all 
measurements  is  the  harmonic  mean  of  the  individual  variances.  In  this  classifi¬ 
cation  task,  the  best  a  value  for  the  integration  of  distances  might  be  explained 
as  in  the  best  a  value  for  the  variances.  For  more  discussions  about  selecting  the 
a  values,  see  [13]. 

In  Fig.  fib,  however,  when  the  training  data  size  is  smaller  than  around  80 
phonemes,  our  proposed  method  is  slightly  worse  than  the  individual- based  map. 
The  intersection  is  somewhere  between  50  and  100.  This  phenomenon  might  be 
explained  as  follows.  When  the  training  data  set  is  small,  it  is  not  enough  to 
represent  the  real  phoneme  spaee.  So,  the  test  points  eould  have  been  projected 
into  a  distorted  map  induced  by  the  other  speakers'  information.  But  when  the 
training  data  set  is  large  enough,  the  projected  spaee  represents  more  likely  the 
real  phoneme  space.  So,  from  the  test  points,  the  speaker  dependent  noise  is 
removed,  which  leads  to  better  classification. 


4.4  Sensorimotor  Integration 

To  apply  MAI  to  sensorimotor  integration  [7],  we  simulated  sensory  and  motor 
information  as  shown  in  Fig.  6.  The  sensory  information  in  (b),  mimicking  a 
non-linear  transformation  (e.g.,  log-polar  transform)  in  the  visual  system,  is  a 
distorted  version  of  the  true  square  map  in  (a).  The  motor  information  in  (e) 
is  based  on  two  angles  of  an  articulated  arm  to  reaeh  locations  within  the  true 
coordinate.  The  arm  consisted  of  two  sticks  of  same  length.  Given  a  point  (tx  ty) 
in  the  true  map  with  tx  £  [0,2]  and  ty  £  [0,2],  the  corresponding  point  in  the 
sensory  map  (sx,  sy)  was  given  by 


(  £>x  —  L  0.05, 

\  $y  ~  y/'ty  +  0.05. 


With  the  two  sticks  and  (12  pointing  (tx,ty)  with  the  end  of  c 12 ,  the  corre¬ 
sponding  point  in  the  motor  map  (mx,my)  was  calculated  by 


f  mx  =  angle  between  a\  and  the  horizontal  axis,  . 

[  my  =  angle  between  a  \  and  (12,  V  ' 

where  the  angles  are  in  radian. 

I11  sensorimotor  integration,  two  different  information  ean  be  converted  into  a 
common  representation  (or  integrated  manifold  in  our  words)  for  fusion,  which  is 
referred  to  as  coordinate  transformation  [14].  [14]  suggested  maximizing  mutual 
information  between  sensory  information  and  motor  information  on  the  common 
representation  while  preserving  topographic  order  to  find  two  different  mapping 
functions  without  considering  structural  information  inherent  in  the  data  set, 
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Fig.  6.  Sensorimotor  Integration:  (a)  true  map  and  a  straight  lino  (reference),  (b) 
sensory  map  and  the  projection  of  the  reference  line,  (c)  motor  map  and  the  projection 
of  the  reference  line,  and  (d)  an  integrated  map  and  the  two  projections  of  the  reference 
line  from  the  sensory  and  the  motor  space,  respectively.  We  used  M Ala  with  a  —  3. 
The  circles  with  the  same  color  in  the  4  maps  represent  the  same  location. 


while  we  can  simply  obtain  two  such  mapping  rules  from  MAI,  based  on  the 
structural  information  from  the  integrated  manifold. 

For  example,  when  we  draw  a  straight  line  on  the  true  map,  we  get  two  kinds 
of  information  at  the  same  time:  sensory  and  motor,  as  the  curves  on  the  two 
maps  (b)  and  (e).  Though  we  cannot  directly  compare  the  two  curves  on  the  two 
different  maps,  with  MAI  we  can  project  the  two  curves  onto  one  integrated  map 
and  compare  them  as  in  (d).  The  bine  curves  arc  from  the  sensory  space  and 
tile  red  curves  from  the  motor  space  in  (b),  (c)  and  (d).  In  (d),  the  two  curves 
closely  overlap,  though  they  are  not  perfectly  the  same  because  the  maps  are 
not  perfect.  This  way,  we  can  compare  the  sensory  and  the  motor  information 
directly  on  the  integrated  manifold  which  gives  a  common  representation. 


5  Conclusion 

In  this  paper,  we  proposed  a  generalized  manifold  integration  method  utilizing 
a-integration  which  led  to  MAI.  MAI  integrates  multiple  measurements  each  of 
which  is  assumed  to  lie  on  a  separate  manifold.  We  showed  that  MAI  includes  its 
its  special  case  the  previous  methods  such  as  HAMS,  kernel-based  data  fusion, 
or  mixture  of  random  walks.  Furthermore,  it  can  generalize  to  other  integrated 
spaces  in  as  many  different  ways  as  we  want  with  a  different  a  value.  The  exper¬ 
imental  results  confirmed  that  MAI  integrates  multiple  measurements  into  one 
manifold  in  an  effective  manner,  helping  us  to  better  understand  the  data  set. 
For  example,  when  we  applied  MAI  to  real  world  data  sets,  it  found  a  better 
manifold  than  the  individual  manifolds.  In  classification  tasks,  the  integrated 
manifold  generally  improved  the  accuracy  when  the  training  data  set  is  rea¬ 
sonably  large.  Also,  MAl  wras  successfully  applied  to  sensorimotor  integration. 
The  main  contributions  of  this  paper  are  as  follows:  (1)  derivation  of  a  general¬ 
ized  manifold  integration  algorithm  and  (2)  showing  that  manifold  integration 
is  useful  to  many  potential  problems.  We  expect  our  results  to  serve  as  an  ef¬ 
fective  framework  for  analyzing  multimodal  data  sets  on  multiple  manifolds.  In 
this  paper,  we  reconstructed  the  integrated  space  assuming  that  the  a  value  is 
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manually  chosen.  In  our  future  work,  we  intend  to  develop  ways  to  find  the  a 
value  automatically,  optimized  for  the  specific  task. 

Acknowledgments.  This  work  was  supported  by  Korea  NRF  Converging  Research 
Center  Program  (No.  2009-0093714),  N1PA  ITRC  support  program  (NIPA-2010-C1090- 
1031-0009),  and  NRF  WCU  Program  (Project  No.  R31-2008-000- 10100-0). 


References 

1.  Seung,  H.S.,  Lee.  D.D.:  The  manifold  ways  of  perception.  Science  290,  2268  2269 

(2000) 

2.  Hall,  D.L.,  Llinas,  J.:  An  introduction  to  inultisensor  data  fusion.  Proceedings  of 
the  IEEE  85(1),  369  376  (1997) 

3.  Amari,  S.:  Integration  of  stochastic  models  by  minimizing  a-divcrgcnce.  Neural 
Computation  19,  2780  2796  (2007) 

4.  Choi,  H.,  Choi,  S.,  Choe,  Y.:  Manifold  integration  with  Markov  random  walks.  In: 
Proc.  Association  for  the  Advancement  of  Artificial  Intelligent  (AAAI),  Chicago, 
1L,  vol.  1,  pp.  424  429  (2008) 

5.  Lanckriet,  G.R.G.,  Deng,  M.  Cristianini,  N.,  Jordan,  M.I.,  Noble,  W.S.:  Kernel- 
based  data  fusion  and  its  application  to  protein  function  prediction  in  yeast  In: 
Proe.  Pacific  Symposium  on  Biocomputing  (PSB),  Big  Island,  HI,  vol.  9,  pp.  300 
311  (2004) 

6.  Zhou,  D.,  Burges,  C.:  Spectral  clustering  and  transductive  learning  with  multiple 
views.  In:  Proc.  Int’l  Conf.  Machine  Learning,  pp.  1159  1166  (2007) 

7.  Todorov,  E.:  Optimality  principles  in  sensorimotor  control.  Nature  Neuro¬ 
science  7(9),  907-915  (2004) 

8.  Choi,  H.,  Katake,  A.,  Choi,  S.,  Choe,  Y.:  Alpha-integration  of  multiple  evidence. 
In:  Proc.  IEEE  Int’l  Conf.  Acoustics,  Speech,  and  Signal  Processing,  Dallas,  TX, 
pp.  2210  2213  (2010) 

9.  Choi,  1L,  Choi,  S.:  Robust  kernel  Isomap.  Pattern  Recognition  40(3),  853  862 
(2007) 

10.  Cailliez,  F.:  The  analytical  solution  of  the  additive  constant  problem.  Psychome- 
trika  48(2),  305  308  (1983) 

11.  Algazi,  V.R.,  Duda,  R.O.,  Thompson,  D.M.,  Avendano,  C.:  The  CIPIC  HRTF 
database  In:  Proc.  2001  IEEE  Workshop  on  Applications  of  Signal  Processing  to 
Audio  and  Acoustics,  pp.  99  102  (2001) 

12.  Koininek  J.,  Black,  A  W,:  CMU  ARCTIC  databases  for  speech  synthesis  (2003) 

13.  Choi,  H.  Choi,  S..  Katake,  A.,  Choc,  Y.:  Learning  alpha- integration  with  partially- 
labeled  data.  In:  Proc.  IEEE  Int’l  Conf.  Acoustics,  Speech,  and  Signal  Processing, 
Dallas,  TX,  pp.  2058-2061  (2010) 

14.  Ghahramani,  Z.,  Wolpert,  D.M.,  Jordan,  M.J.:  Computational  models  of  sensori¬ 
motor  integration.  Science  269,  1880  1882  (1997) 


Ranking  Entities  Similar  to  an  Entity 
for  a  Given  Relationship 


Yong-Jin  Han1,  Seong-Bae  Park1,  Sang-. Jo  Leo1,  So  Young  Park1’*, 
and  Kweon  Yang  Kim2 

1  School  of  Electrical  Engineering  and  Computer  Science, 

K.vnngpook  National  University. 

Daegu  702-701,  Korea 

{yjhan,sbpark,sjlee,sypark}®sejong.knu. ac. kr 

2  "School  of  Computer  Engineering,  Kynngil  University, 

Gyeongsan  712-701  Korea 
kykim@kiu.ac . kr 


Abstract.  This  paper  proposes  a  similarity  ranking  method  for  enti¬ 
ties  in  the  real  world.  Real  world  entities  like  people  or  objects  often 
have  some  relationship  between  themselves.  Finding  such  relationships 
from  real  world  data  can  greatly  enhance  recognition  of  real  world  sit¬ 
uations  However,  it  is  difficult  to  capture  such  relationships  from  real 
world  sensors  alone.  Nowadays,  activities  of  people  are  often  shared  via 
Web.  The  activities  can  be  represented  as  a  relationship  between  peo¬ 
ple  with  shared  items  such  as  books,  movies  or  other  items.  In  semantic 
Web  research,  such  relational  information  has  been  modeled  in  ontolo¬ 
gies.  The  proposed  ranking  method  of  this  paper  is  a  method  that  finds 
meaningful  relationships  between  entities  in  ontologies.  In  the  first  step, 
the  method  discovers  pairs  of  entities  which  have  meaningful  connections 
in  an  ontology.  Then  it  ranks  the  pairs  according  to  .similarities  between 
entities.  Unlike  previous  work,  the  proposed  method  assumes  not  only 
instance  level  connections,  but  also  ontology  schema  level  connections. 
This  approach  enables  machines  to  access  previously  hidden  indirect  rela¬ 
tionships  into  the  similarity  rankings.  The  experiments  using  an  existing 
people-experience  ontology  show  that  the  proposed  method  outperforms 
previous  methods. 

Keywords:  Ranking  entities,  Ranking  method,  Semantic  Association, 
Relationship. 


1  Introduction 

A  machine  can  recognize  people  or  other  objects  by  using  various  machine  learn¬ 
ing  methods  and  intelligent  sensors.  For  example,  an  artificial  neural  net  work  has 
been  used  to  recognize  people  or  objects  by  adapting  its  weight  parameters  with 
labeled  voice  data  or  labeled  image  data.  However,  the  recognized  entities  are 
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still  fragmentary  knowledge  and  they  only  represent  the  identity  of  the  person 
or  objects.  In  the  real  world,  almost  every  entity  is  related  with  each  other. 

Comprehensive  knowledge  can  be  derived  from  such  relationships  between 
entities.  To  acquire  deeper  understanding  of  “contexts”,  modeling  and  finding 
of  such  relationships  arc  very  important.  For  example,  let  s  assume  that  two 
people  have  been  identified  in  the  same  room.  And  some  smart  sensors  have 
recognized  audio  and  video  signals  that  could  be  understood  as  “task  related  with 
firev .  What  do  they  actually  do  in  the  situation?  To  understand  the  situation, 
understanding  of  the  context  is  essential.  Also,  expressed  knowledge  of  previous- 
relationship  among  entities  (in  this  case,  people  and  the  fire)  is  also  essential. 
If  two  people  have  a  previous  known  relationship  in  the  cooking  session,  and 
some  knowledge  about  cooking  and  its  relatedness  with  fire,  can  greatly  enhance 
understanding  of  the  context  at  hand. 

It  is  not  realistic  to  assume  that  such  knowledge  about  entities,  especially 
about  people,  can  be  captured  by  intelligent  sensors  networks.  Nowadays,  the 
activities  of  people  arc  often  shared  via  the  weh.  Web  2.0  services  like  Twitter, 
Facebook,  and  Flickr  are  now  recording  various  information  about  people  in 
unstructured  or  semi-structured  data  For  example,  people  post  their  experiences 
about  books  or  movies  in  blogs  and  ruicroblogs,  often  with  clear  evidences  like 
database  links  or  semantic  web  tags. 

With  the  advent  of  the  semantic  web.  the  activities  have  been  modeled  as 
an  ontology  which  is  in  a  machine- readable  form  [8].  Such  an  ontology  provides 
sophisticated  information  about  how  people  are  related  with  each  other.  Thus, 
the  common  ground  among  people  in  the  real  world  can  be  analyzed  by  using 
the  existing  ontologies  that  describe  Web  users’  activities. 

In  this  paper,  we  are  focused  on  analyzing  the  relationships  among  entities 
based  on  an  ontology,  when  the  entities  (in  this  case,  people)  are  recognized  by 
a  cognition  system.  Web  ontology  language  (OWL)  is  a  Semantic  Web  language 
designed  to  represent  resources  and  publish  them  on  the  Web.  OWL  is  a  graph 
based  representation  of  knowledge.  In  OWL,  a  unique  entity  is  represented  as 
an  instance  (a  node)  and  the  relationships  between  entities  are  represented  as 
object  properties  (edges).  Instances  that  have  direct  relationships  have  direct 
link  between  them.  Also,  it  is  possible  to  find  indirect  relationship  by  a  third 
instance  as  a  mediator.  The  former  case  is  easy  to  find  a  relationship  between 
instances,  but  the  latter  case  can  he  discovered  only  through  the  paths  composed 
of  nodes  (instances)  and  edges  (properties). 

Such  a  relationship  can  be  intcrpretable  by  classes  and  properties  of  an  ontol¬ 
ogy.  Thus,  in  this  paper,  a  relationship  between  instances  belonging  to  a  class 
is  defined  as  a  path  of  classes  and  properties  In  order  to  discover  more  mean¬ 
ingful  relationships,  the  relationship  is  restricted  as  a  path  which  has  a  mediate 
class.  For  example,  let  an  ontology  have  Person  and  Food  as  classes  and  cook 
as  a  property  which  links  the  two  classes.  Two  instances,  pi  and  p 2  are  both  in¬ 
stances  of  Person.  They  are  both  linked  to  an  instance  belonging  to  Food,  with 
property  of  cook.  Then,  it  is  possible  to  declare  that  they  have  a  relationship, 
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and  the  relationship  in  this  case  is  [pi ,  cook.  Fcxxi.  cook,}^].  The  two  instances 
have  identical  path  from  each  of  them  to  the  mediator.  Food. 

This  paper  proposes  a  method  to  find  similar  ontology  instances,  by  ranking 
instances  of  the  same  (  lass  in  terms  of  a  given  relationship.  For  example,  if  a 
person  is  given  to  the  method  with  a  target  relationship,  the  proposed  method 
will  find  similar  people  (instances  of  the  same  class)  in  terms  of  the  given  rela¬ 
tionship.  Thus,  with  the  proposed  method  and  a  sufficient  ontology,  it  is  possible 
to  find  “Find  related  person  with  this  person  A.  in  terms  of  cooking”,  or  “List 
all  personals  that  are  similar  to  me,  in  terms  of  reading  of  philosophy  hooks”. 

There  are  several  previous  work  on  similar  tasks  [3,20.14].  Ranking  or  simi¬ 
larity  methods  of  previous  work  are  generally  based  on  connected  paths  between 
entities.  However,  there  could  be  some  other  meaningful  relationships  at  the 
level  of  schema  in  an  ontology,  though  two  instances  are  not  connected  with  any 
mediate  instances.  For  example,  if  it  is  possible  that  two  people  do  not  share 
any  instance  directly  (say,  no  same  book),  yet  they  share  something  in  common 
in  the  level  of  schema  (say,  same  genre  of  books,  or  same  group  of  books). 

To  validate  the  approach,  the  relationship  among  people  for  a  specific  cate¬ 
gory  of  books  and  a  genre  of  movies  has  been  tested  to  find  and  rank  similar 
people.  The  result  have  shown  that  tin*  correlation  between  a  labeled  rank  and  a 
rank  of  the  proposed  method  have  positive  correlation.  The  proposed  method  is 
also  compared  with  two  baseline  methods.  First  method  is  a  method  that  only 
compares  paths  that  connect  instances  and  disregards  class  level  paths.  The 
other  method  is  a  method  that  considers  only  the  paths  in  the  schema  level.  The 
proposed  method  outperforms  both  methods. 

T  he  rest  of  paper  is  organized  as  follows.  Section  2  reviews  some  related  works, 
and  Section  3  presents  how  to  discover  on  tit  it's  with  a  relationship  in  an  ontology. 
Section  4  explains  ranking  method  in  order  to  rank  discovered  entities  bv  a  given 
relationship.  Section  5  shows  the  experimental  results,  and  Section  6  concludes 
the  paper. 

2  Related  Work 

Kemafor  et  al.  formalized  relationships  between  instances  for  the  RDF  data 
model  which  is  called  semantic  associations  11  .  They  defined  four  types  of  se¬ 
mantic  associations  for  a  given  property  sequence.  For  example,  a  simple  seman¬ 
tic  association  between  two  instances  is  defined  as  a  connected  path  from  one  of 
them  to  the  other  one  through  the  property  sequence.  A  property  sequence  ex¬ 
plicitly  expresses  the  meaning  of  a  relationship.  However,  the  property  sequence 
('an  be  interpreted  in  different  way  by  classes  where  instances  belong.  Boanerges 
et  al.  [fi]  showed  that  such  classes  are  useful  to  find  a  specific  relationship.  For 
example,  when  identifying  money  laundering,  it  is  meaningful  that  a  semantic 
association  has  an  instance  belonging  to  a  class  Dank.  That  is,  relationships  by 
a  property  sequence  are  discriminated  by  specifying  the  gaps  between  properties 
with  classes. 

In  this  paper,  a  relationship  is  defined  without  ambiguity  by  using  a  sequence 
of  classes  and  properties.  In  addition,  we  focus  on  relationships  among  instances 
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belonging  to  the  first  class  of  a  given  class  and  property  sequence.  Thus  what 
is  important  in  this  paper  is  not  only  to  discover  relationships  from  instances 
but  also  to  measure  similarities  between  relationships  from  one  instance  and 
relationships  from  the  other  one. 

There  are  many  previous  work  that  tries  to  discover  meaningful  relationships 
between  ontology  instances  [1,2,4].  Recently,  SPARQL  grammar  is  utilized  for 
discovering  semantic  associations  [12,13].  Not  only  the  discovered  associations 
are  new  information  by  itself,  but  also  the  path  between  instances  can  be  used 
to  measure  the  degree  of  relationships.  In  this  paper,  standard  SPARQL  is  used 
to  discover  relationships  as  paths  between  instances. 

A  measurement  for  a  relationship  between  classes  were  proposed  in  [16,17]. 
They  used  the  probability  information  on  instances  related  to  a  relationship 
between  two  classes.  The  work  is  similar  to  ours  in  the  aspect  that  a  relationship 
represented  as  classes  and  properties  is  measured.  However,  we  are  interest  in 
relationships  between  instances  while  [16.17]  only  deals  that  of  classes. 

Boanerges  et  al.  proposed  six  methods  to  measure  the  degree  of  relationships 
between  instances  [5].  They  meaningfully  considered  a  connected  path  between 
instances  to  measure  relationship  degree.  Tims,  the  main  difference  between  our 
work  and  others  [16,17,5]  is  that  not  only  connected  path  between  instances 
but  also  connected  paths  through  a  schema  are  used  to  measure  relationship 
degree  between  instances.  Even  though  there  is  no  connected  path  between  two 
instances  in  instance  level,  some  of  paths  from  each  of  them  are  used  to  measure 
the  relationship  through  the  schema.  More  details  are  discussed  in  the  next 
section. 

Measuring  relationships  between  instances  is  useful  to  various  domains.  Amit 
et  ah  [3]  used  it  to  find  terrorist  for  national  security  and  money  laundering.  They 
focused  to  discover  paths  between  instances  for  a  given  property  sequence.  Re¬ 
cently,  relationships  between  people  ent  ities  expressed  in  ontologies  were  used  to 
find  social  groups  and  social  networks.  Li  et  al.  [6]  proposed  methods  to  integrate 
FOAF  data  and  to  extract  social  networks.  Anna  et  al.  [20]  modeled  networks  of 
folksonoinies  and  proposed  a  community  dynamics  notification  algorithm  to  dis¬ 
cover  social  networks  from  the  network  model.  There  are  researches  [18,9]  that 
use  ontologies  as  personal  profiles  to  measure  similarity  between  users  respective 
preferences.  The  proposed  method  of  this  paper  can  enhance  applications  like 
preference  modeling  and  social  group  finding,  since  the  proposed  method  can 
reveal  previously  hidden  indirect  relationships. 

3  Discovering  Entities  with  a  Common  Relationship  in 
an  Ontology 

A  property  sequence  represents  an  implicit  and  complex  relationship  between 
instances  of  an  ontology  [11].  The  meaning  of  such  a  relationship  becomes  clear 
by  specifying  classes  [6].  In  the  viewpoint  of  [6],  a  relationship  can  be  expressed  as 


P=[C0,ri,C1,ro,C2,...ri,C 


rn*Cn], 
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where  Cj,()  <  i  <  n.n  >  1,  are  classes,  /*j,l  <  j  <  n ,  are  properties,  C, 
and  are  linked  by  the  property  r*+j  in  a  given  ontology. 

That  is,  a  relationship  is  a  path  in  the  schema  of  an  ontology  graph.  In  this 
paper,  the  first  class  Co  is  called  a  source  class  of  a  relationship  P  and  the  last 
one  Cn  is  called  a  target  class  of  P.  Then  a  relationship  P  are  discovered  by 
finding  a  path  between  an  instance  of  Co  and  an  instance  of  Cn  in  the  instance 
level. 

There  are  many  kinds  of  relationships  as  linked  pairs  of  classes  in  an  ontology. 
Above  all  relationships,  this  paper  focuses  on  relationships  among  instances  of  a 
same  class.  Especially,  it  is  meaningful  that  two  instances  have  an  identical  kind 
of  a  relationship  P.  The  target  class  of  P  is  a  mediator  between  two  instances. 
Intuitively,  it  can  be  interpretable  as  that  two  entities  may  have  common  ground. 
Thus,  in  this  paper,  if  two  instances  of  a  same  class  have  an  identical  kind  of  a 
relationship,  we  say  that  two  instances  have  a  common  relationship.  A  common 
relationship  lxy  between  two  instances  x  and  y  for  a  given  relationship  P  can  he 
expressed  as 

^ xy  ~  [-G  Ci ,  73 , . . . ,  rn .  Cn ,  rn ,  C  \ ,  //] . 

where  P  —  [Co.  n ,  Ci ,  ?‘2,  C2,  C*, ....  rfJ,  C;?],  x  and  y  belong  to  Co.  each 

x  and  y  has  a  relationship.  P.  lxy  and  lyx  are  identical  by  the  expression.  Each 
instance,  x  and  y  is  called  a  somre  instance .  An  instance  of  a  target  class  is  called 
a  target  instance.  Note  that  both  source  instances  need  not  to  be  connected  with 
a  target  instance  in  order  to  have  a  common  relationship. 

In  order  to  discover  a  common  relationship  for  a  given  relationship  P.  source 
instances  of  P  should  be  validated  whether  they  have  the  relationship  P.  It  is 
conducted  by  using  a  formal  query  language  for  ail  ontology  and  an  existing 
roasoner.  P  is  corresponding  to  a  formal  query  to  discover  a  relationship. 

Final  common  relationships  are  decided  as  all  of  the  pairs  of  positively  val¬ 
idated  instances.  Thus,  the  number  of  instances  with  a  common  relationship 
is  less  than  or  equal  to  the  number  of  2-combinations  of  the  source  instances. 
Though  the  number  of  pairs  of  instances  is  finite,  it  increases  exponentially  with 
the  number  of  source  instances.  However,  all  of  common  relationships  can  he 
discovered  ahead  of  query  time.  I11  addition,  our  goal  is  replying  such  a  query 
that  AVliat  is  the  most  similar  instance  to  an  instance  x  for  a  given  relationship 
P”  In  this  case,  the  number  of  candidate  instances  is  equal  to  (#  of  source 
instances  of  P)— 1 . 

The  focus  of  our  current  evaluations  is  involved  measuring  common  ground 
among  people  based  on  an  event  ontology  [19j.  The  ontology  describes  peoples’ 
experiences  for  books  or  movies.  Figure  1  shows  examples  of  common  relation¬ 
ships  in  the  ontology. 

The  class  Person  is  related  with  the  class  Event  through  the  property  ha- 
sEvcnt.  Event  is  defined  based  011  fundamental  factors  which  describe  an  event. 
That  is,  an  object  of  an  event  and  a  place  and  time  related  to  the  event  is  rep¬ 
resented  as  properties  of  Event.  The  Event  has  two  subclasses,  Appreciate  and 
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Fig.  1.  Examples  of  common  relationships  in  the  event-ontology 

Read.  Read  is  related  with  Book  through  object.  The  Book  has  subclasses  such 
as  History ,  Literature .  and  Religion  which  are  categories  of  books. 

When  a  relationship  is  given  as  Pi  =  [Person,  has  Event,  Read ,  object ,  Liter¬ 
ature}.  T  he  instances,  7j  of  the  class,  Person  has  two  paths  which  satisfy  the 
relationship  Pi .  Also,  I2  have  such  two  paths.  Thus,  both  of  them  have  a  common 
relationship  for  P\. 

They  are  connected  with  two  same  target  instances.  However,  a  linked  path 
in  instance  level  is  not  a  necessary  condition  for  a  common  relationship.  Let  a 
relationship,  Pi  be  [Person,  has  Event .  Read ,  object ,  History].  There  is  no  target 
instance  that  connects  the  instances  I\  and  1 2  for  P2.  However,  two  history 
books,  b  1  and  b$  are  a  target  instance  from  I\  and  that  from  I2  respectively. 
Thus,  they  have  a  common  relationship  for  P2.  An  interesting  thing  is  that  b\ 
and  ft#  are  written  by  p 3  of  Person .  It  means  that  the  two  people  read  a  same 
author’s  books,  though  the  books  are  different. 

On  the  other  hand,  target  instances,  64  and  fcg  of  I\  share  no  common  property 
with  target  instances  of  72-  However,  they  have  a  connection  with  all  of  the 
target  instances  of  I2  in  the  schema  level.  That  is.  the  target  instances  of  7i 
are  connected  with  them  of  1 2  through  the  mediator  Cn.  l\  and  I2  also  have  a 
common  relationship  for  P2  in  such  a  case.  Though  two  people  do  not  share  any 
book,  if  each  of  them  has  been  read  many  books  of  a  same  category,  they  may 
share  something  to  talk  with  each  other. 

4  Ranking  Common  Relationships  among  Instances 

We  propose  the  following  target  function  to  rank  common  relationships  for  a 
relationship  P. 
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r  p  r  U(rel(Q-rel(lL 
fi'nb^cd)  —  j  -1  if  rcl(l"b)  -  rd(,^d 


£)  >  o. 

<  0, 


, where  lxy  E  P/>. 


/?/»  is  a  set  of  possible  common  relationships  for  P.  The  function,  rel(l£y)  is  a 
measurement.  for  the  common  relationship  /£  .  It  is  a  symmetric  function  on  the 
two  variables  and  /£.  That  is,  /(/£,,  O  >s  equal  to  -/(/£,  O- 

The  basic  idea  to  measure  the  degree  of  relationships  between  two  instances 
is  that  the  more  connections  they  have,  the  more  similar  they  are.  What  is 
important  is  that  the  connections  through  a  schema  level  should  be  considered 
for  the  measurement. 

To  do  this,  discovered  relationships  for  P  are  grouped  according  to  their  source 
instances.  Let  be  a  set  of  discovered  relationships  from  a  source  instance  x  of 
a  relationship  P.  Then,  G is  a  set  of  relationships  from  ?/.  The  function  rr/(/Jy) 
is  realized  by  measuring  similarity  between  these  two  groups. 

Features  of  a  group  are  represented  as  a  vector  to  measure  the  similarity. 
They  are  corresponding  to  property  values  of  target  instances.  A  simple  way  to 
decide  a  feature  value  is  to  count  the  number  of  occurrences  of  property  values 
from  target  instances.  By  doing  this,  connectivity  by  a  target  instance  and  their 
property  values  can  be  quantified  by  a  similarity  measure  between  two  vectors. 

An  additional  feature  should  be  considered  in  order  to  measure  connectivity 
between  instances  in  the  schema  level.  The  following  is  a  measurement  for  the 
connectivity  from  a  source  instance  x  to  the  target  class  Cn  of  a  relationship  P. 

..  .t  ,  r,  m  #  of  paths  ill  G£ 

connect  ivitvpr,  CTM  P)  =  •— — - - — - ; - ; — — — - - - 

#  of  paths  from  x  to  instances  ni  t  i  through  r\ 

Each  path  of  G £  starts  with  a  hop  from  x  to  instances  of  C\  through  rq.  How¬ 
ever,  not  all  paths  starting  with  such  hops  belong  to  G^ .  11ms  the  function 
connectivity  quantifies  how  much  an  instance  x  allots  for  the  target  class  Cn . 

Therefore,  without  loss  of  generality,  a  feat  ure  vector  of  can  be  expressed  a 
weighted  form.  Let  n  is  a  number  of  distinct  property  values  of  target  instances. 
Then  a  weighted  vector  for  G^  is  expressed  as 


I's  =  [ttfi r- f ,  Wo v% t’f , . •  • ,  U v* , w„ + , vf, + , ] , 


where  vf .  1  <  i  <  n,aro  frequencies  of  property  values  from  target-  instances, 
*n+i  =  connectivity (x,  Cn,  P),  Wj ,  1  <  j  <  n,  are  weights  for  connectivity  in  in¬ 
stance*  level,  wn+\  is  a  vv(*ight  for  connectivity  in  schema  level,  and  ~  L 

We  are  interest  in  the  rncaningfiilness  of  the  connectivity  bet  ween  instances  in 
the  schema  level.  For  simplicity  we  assume  that  the  weights  in  instance  level  are 
identical.  Let  v£f  be  a  vector  without  the  last  feature  of  r*\  Then  the  function 
rel(l£y)  can  be  defined  by 


r(l(>xy)  =  •  VV  +  P2K+\vn 


+  1 


, where  a  =  nw\  and  0  —  wn+\. 


The  instance- level  connectivity  is  measured  by  the  inner  product  between  r 
and  Vy* .  Connectivity  in  schema  level  is  measured  by  the  product  between 
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and  «»+l  .  a  is  the  total  weight  for  connectivities  in  instance  level.  This  function 
is  equivalent  to  an  inner  product  between  and  v£.  The  function  rcl  is  applied 
to  the  target  ranking  function  /.  Then  the  weights  a  and  (3  can  be  adapted  by 
using  rank-labeled  common  relationships. 

Note  the  ranking  function  provides  general  ranks  for  common  relationships 
among  instances.  Ranks  of  instances  similar  to  a  given  entity  e  is  decided  by 
sorting  common  relationships  with  e. 

5  Experiment 

To  evaluate  the  proposed  method,  we  use  an  event-ontology  that  describe  people' 
blog  postings  for  books,  movies  and  lT-products  [19].  We  randomly  selected 
1 1  bloggers  whose  posting  are  more  than  thirty  for  eaeli  domain  of  books  and 
movies.  Six  human  annotators  attended  to  label  ranks  of  bloggers  according  to 
common  ground  with  a  blogger.  Each  annotator  took  up  a  blogger’s  position. 
Therefore  an  annotator  ranked  the  other  bloggers  for  a  given  query. 

Table  1  is  an  example  of  bloggers’  ranks  for  a  query,  “Rank  tile  bloggers  that 
are  similar  to  your  blogger  in  terms  of  reading  of  philosophy”. 


Table  1.  Ranks  of  bloggers  similar  to  a  blogger  in  terms  of  reading  of  philosophy 


Pi 

P2 

P3 

Pi 

P5 

P6 

Pi 

- 

4 

5 

5 

3 

4 

P2 
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Annotat  ors  ranked  1 0  bloggers  on  posit  ions  of  bloggers  in  the  first  row.  The  first 
column  represents  the  ranked  bloggers.  For  example,  bloggers  from  P2  to  p{i  are 
sorted  as  p$ ,  p^  ,  p2 *  Pa  -  Pa  on  the  position  of  p\ .  Ranks  of  the  first  five  rows  are  used 
to  adapt  weight  parameters  of  t  he  proposed  ranking  method.  The  parameters  were 
determined  empirically  by  using  correlations  between  labeled  ranks  and  ranks  by 
the  proposed  method.  The  method  is  tested  with  the  last  five  row  data. 

The  following  4  queries  are  used  for  the  experiment. 

—  Ql.  Rank  the  bloggers  that  are  similar  to  your  blogger  in  terms  of  reading 
of  philosophy. 

—  Q2.  Rank  the  bloggers  that  are  similar  to  your  blogger  in  terms  of  reading 
of  economy. 
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Q3.  Rank  the  bloggers  that  are  similar  to  your  blogger  in  terms  of  appreci¬ 
ation  of  action. 

Q4-  Rank  the  bloggers  that  are  similar  to  your  blogger  in  terms  of  appreci¬ 
ation  of  animation. 

Ql  and  Q2  are  related  with  about,  categories  of  books.  Q3  and  Q4  are  about 
genres  of  movies.  These1  queries  are  corresponding  to  relationships  in  Table  2. 

Table  2.  Relationships  corresponding  to  the  test  queries 


Query  Relationship 

Ql  [Person,  h  as  Event,  Read,  object,  Philosophy] 

Q2  [Person,  hasEvent,  Read,  object.  Economy] 

Q 1  [Person,  hasEvent,  Appreciate,  object.  Action] 
Q4  [Person,  hasEvent,  Appreciate,  object,  Animation] 


The  proposed  method  discovers  common  relationships  from  the  relationships 
in  table  2.  It  used  SPARQL  [15]  as  a  formal  query  language  for  an  ontology,  and 
jena  API  [7]  as  a  reasoner. 

Two  base  lines  were  tested  for  comparative  analysis  with  the  proposed  method. 
First  one  /,  is  a  method  that  only  compares  paths  connected  by  target  instances. 
The  method  is  simply  modeled  by  setting  the  weight  (3  of  the  proposed  method 
to  0.  The  other  one  fs  is  a  method  that  considers  only  paths  in  the  schema  level. 
This  method  is  given  by  setting  the  weight  a  to  0.  The  proposed  ranking  method 
is  denoted  as  /,+*. 

Figure  2  shows  experiment  results  of  the  three  measurements  for  four  queries. 
The  x  axis  represents  bloggers.  For  each  of  the  bloggers,  correlations  between 
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0 

pi  p2  p3  p4  p5  p€ 
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-^~k 
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Fig.  2.  The  correlation  between  test  data  and  the  results  by  three  measurements 
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labeled  ranks  and  ranks  by  three  measurements  are  recorded  in  y  axis.  Weights 
for  the  proposed  method  are  marked  under  each  graph.  The  bigger  correlation 
means  more  similar  to  test  data. 

Both  Ql  and  Q 2  ask  the  common  ground  for  a  book  category.  However,  the 
two  graphs  shows  very  different  results.  ft  is  relatively  superior  in  Ql,  but  not 
in  Q 2.  In  addition,  any  target  value  of  /*  is  not  given  for  ps  and  p*  in  Q2.  It 
means  that  books  read  by  p3  and  P4  do  not  share  any  properties  with  others. 
Actually,  their  relationship  group  vectors  are  very  sparse,  since  they  read  just 
two  books  of  economy.  Such  sparseness  is  serious  in  Q2.  p2  and  pe  read  the  most 
number  of  books  of  economy  and  the  number  is  just  six.  On  the  other  hand,  the 
six  bloggers  read  at  least  eight  books  of  philosophy.  Therefore,  it  is  possible  to 
assume  that  if  sufficient  experiences  are  observed,  /*  works  well.  A  remarkable 
thing  is  that  fs  meaningfully  worked  in  Q2.  The  proposed  method  /t+*-  reflects 
such  a  tendency  that  which  one  is  more  meaningful  between  fx  and  fs.  That  is, 
/;+s  are  nearly  identical  with  fi  in  Ql  and  fs  in  Q 2. 

Both  Q 3  and  Q 4  are  related  to  genre  of  movies.  Most  of  the  bloggers  appre¬ 
ciated  at  least  ten  movies.  Unusually,  pi  in  Q3  appreciated  just  one  movie  of 
action.  Thus  a  target  value  of  fi  cannot  be  determined  for  p\ .  The  most  number 
of  movies  is  64  and  26  for  each  of  action  and  animation.  Thus,  both  the  cases  are 
relatively  free  for  the  sparseness  problem  than  the  cases  of  Ql  and  Q 2.  ft  gives 
more  than  0.7  correlation  in  most  of  the  cases  for  Q3  and  Q4.  However,  fs  gives 
more  consistent  results  than  /*.  It  means  that  the  participants  are  interested  in 
unseen  movies  as  well  as  already  soon  movies.  1  he  proposed  method  ft+$  shows 
an  improved  result  of  fx  by  reflecting  fs. 

Most  of  results  by  the  proposed  method  /*+s  showed  more  than  0.6  of  corre¬ 
lation.  It  outperformed  than  two  baseline  methods. 

Then,  how  does  the  connectivity  in  schema  level  contribute  to  rank  similar 
instances?  In  order  to  answer  this  question,  schema-level  connectivities  of  six 
bloggers  are  presented  in  table  3. 


Table  3.  Sclieina-lcvel  connectivities  from  six  blogger  instances  to  target  classes 


_ Pi  P'2  Pi  P±  P5  E'} 

Philosophy  0.27  0.26  0.16  0.35  0.43  0.33 
Economy  0.10  0.14  0.01  0.04  0.11  0.14 
Action  0.04  0.23  0.33  0.23  0.20  0.29 
Animation  0.22  0.05  0.14  0.16  0.26  0.21 


In  this  experiment,  a  connectivity  from  an  instance  (a  blogger)  to  a  target  class 
(a  category  or  a  genre)  is  equal  to  the  ratio  of  the  number  of  seen  items  in  specific 
categories  to  the  total  number  of  seen  items  in  a  domain  by  a  blogger.  Intuitively, 
it  can  be  considered  as  the  degree  of  interest  for  a  book  category  or  a  movie  genre. 
Most  of  the  ratios  for  economy  are  less  than  any  others.  We  observed  ft  are  very 
meaningful  for  Ql  (philosophy)  ,  but  not  for  Q2  (economy).  Especially,  pz  in 
philosophy  and  p.j  in  animation  show  a  definite  tendency  that  less  interest  a 


Ranking  Entities  Similar  to  an  Entity  for  a  Given  Relationship 


419 


blogger  has.  more  dependent,  on  others'  experience  the  blogger  become.  In  other 
words,  if  people  are  interested  in  a  specific  domain,  then  common  experiences 
for  same  books  or  movies  are  important  to  construct  common  ground  to  debate. 
However  po  in  action  and  pz  in  animation  shows  results  against  this  tendency 
It  means  that  inexperienced  objects  could  be  meaningful  to  construct  common 
ground  irrespective  the  interest. 

6  Conclusion 

In  this  paper,  we  proposed  a  similarity  ranking  method  for  entities  with  a  com¬ 
mon  relationship.  A  common  relationship  between  instances  is  formalized  as  a 
bi-directional  link  of  classes  and  properties  If  two  instances  have  an  identical 
path  pattern  from  each  of  them  to  an  instance,  they  have  a  common  relationship 
by  the  formalism.  Thus  hidden  indirect  relationships  between  instances  can  be 
discovered  as  a  connected  path  through  the  schema  of  an  ontology.  The  proposed 
ranking  method  uses  not  only  connected  path  in  instance  level,  but  also  path 
through  the  schema  of  an  ontology.  The  experiment  result  s  shows  our  method  is 
more  correlated  with  people  intuition  than  a  method  just  considered  connected 
paths  between  instances. 

The  proposed  method  can  conduct  to  rank  common  relationships  among  enti¬ 
ties.  That  is,  when  A ,  B.  C.  and  D  are  different  entities,  the  method  can  decide 
relative  degree  of  relationship  between  any  two  pairs  of  them.  This  can  be  ap¬ 
plicable  such  a  case  that  find  the  best  partner  among  candidates  for  a  project. 
In  future  work,  such  a  task  will  be  studied  by  using  the  proposed  method. 
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Abstract.  Motion  trajectories  provide  rich  spatio-temporal  information 
about  an  object  activity.  In  this  paper,  we  present  a  novel  anomaly  detec¬ 
tion  framework  to  detect  anomalous  motion  trajectory  using  the  fusion 
of  adaptive  piecewise  analysis  and  fuzzy  rule-based  method.  That  is, 
first  of  all  we  address  the  problem  by  segmenting  our  moving  objects 
using  a  Gaussian  mixture  background  model.  Secondly,  visual  tracking 
using  probabilistic  appearance  manifolds  to  extract  spatio-temporal  tra¬ 
jectory.  Thirdly,  adaptive  piecewise  analysis  and  data  quantization  are 
performed  on  the  extracted  trajectory  such  that  the  anomalous  detection 
can  he  performed  as  the  incoming  data  are  acquired.  Finally,  through  the 
accumulative  rank  of  the  adaptive  piecewise  analysis  and  a  fuzzy  rule- 
based  anomaly  detection  framework  to  detect  the  anomalous  trajectory. 
Experimental  results  on  various  challenging  trajectory  data  has  validated 
the  effectiveness  of  the  proposed  method. 


1  Introduction 

Detecting  anomalous  patterns  from  video  sequence  is  useful  for  many  applica¬ 
tions  such  as  surveillance,  novelty  extraction,  automatic  inspection  and  etc.  The 
identification  of  anomalies  can  lead  to  the  discovery  of  truly  novel  information 
from  the  video  [12,5.2,13].  For  instance,  anomaly  behaviour  might  be  a  person 
walking  in  a  region  not  used  by  most  people,  a  car  following  a  zigzag  path,  or  a 
person  running  in  a  region  where  most  people  simple  walk.  A  path  is  any  estab¬ 
lished  line  of  travel  or  access,  and  a  trajectory  can  be  defined  as  a  path  followed 
by  an  object  moving  through  the  space. 

In  this  paper,  we  present  a  framework  for  detecting  nonconforming  trajectories 
of  objects  as  they  pass  through  a  scene  by  the  fusion  of  adaptive  piecewise  anal¬ 
ysis  and  fuzzy  rule-based  method.  We  concern  ourselves  primarily  with  human 
movements  in  a  car  park  scene  blit  the  method  is  general  and  can  be  extended 
to  any  similar  scenario.  First  of  all,  the  moving  objects  in  the  image  sequences 
are  segmented  using  a  Gaussian  mixture  background  model;  follow  bv  the  visual 
tracking  using  probabilistic  appearance  manifolds  [9]  to  extract  spatio-temporal 
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trajectory.  Secondly,  adaptive  piecewise  analysis  and  data  quantization  are  per¬ 
formed  on  the  extracted  trajectory.  That  is,  the  adaptive  piecewise  analysis  is 
performed  after  a  sufficient  amount  of  tracking  data  has  been  accumulated.  The 
appropriate  duration  r  depends  on  the  amount  of  the  traffic  in  the  scene  and 
the  required  accuracy  of  the  model.  Finally,  through  the  accumulative  rank  of 
adaptive  piecewise  analysis  and  a  fuzzy  rule-based  anomaly  detection  framework 
to  detect  the  anomalous  trajectory. 


Input  Video 


foreground  Objects 


Tracks 


Abnormal  Events 


Background 
Modelling  & 
Moving  Object 
Detection 

Object 

Scene 

Tracking 

Analysis 

Fig.  1.  The  Proposed  Anomalous  Motion  Trajectory  Detection  Framework 


The  advantages  for  the  proposed  are  two-fold:  on  one  hand,  we  would  like 
to  keep  the  problem  computationally  tractable  where  exhaustive  training  data 
and  learning  process  can  be  avoided;  on  another  hand,  this  provides  a  means 
to  detect  suspicious  tracks  through  the  accumulative  rank  of  adaptive  piecewise 
analysis  and  fuzzy  rule-based  method  such  that  the  anomalous  detection  can  be 
performed  as  the  incoming  data  arc  acquired,  in  opposition  to  off-line  approaches 
like  many  of  the  aforementioned  works.  Our  main  aim  is  to  avoid  the  classical 
two-step  approaches  (data  collection  and  off-line  processing). 

The  rest  of  the  paper  is  structured  as  follows.  Section  2  discuss  the  related 
work.  Section  3  presents  the  proposed  anomalous  trajectory  detection  using  the 
fusion  of  fuzzy  rule  and  adaptive  piecewise  analysis.  Section  4  shows  the  ex¬ 
perimental  results.  Section  5  concludes  the  paper  with  discussions  and  future 
work. 


2  Related  Work 

Trajectory  analysis  is  an  important  step  in  applications  like  video  surveillance, 
automotive  systems,  medical  screening  and  autonomous  robotic  systems.  Pre¬ 
vious  research  on  abnormal  activity  detection  can  be  roughly  divided  into  two 
categories:  parametric  approaches  and  non- parametric  approaches.  Grirnson  et. 
al.  [G]  use  a  distributed  system  of  cameras  to  cover  a  scene,  and  employ  an  adap¬ 
tive  tracker  to  detect  moving  objects.  Tracks  are  clustered  using  spatial  features 
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on  the  vector  quantisation  approach.  Once  these  clusters  are  obtained  the  un¬ 
usual  activities  are  detected  by  matching  incoming  trajectories  to  these  clusters. 
Hu  et  al.  [8]  present  a  recently  published  technique  in  which  the  tracks  are 
spatially  and  temporally  clustered  into  different  motion  patterns.  Each  of  these 
motion  patterns  is  divided  into  several  segments;  each  segment  is  modelled  by  a 
Gaussian  model  of  speed  and  size.  Makris  and  Ellis  [4]  develop  a  spatial  model 
to  represent  the  routes  in  an  image.  A  trajectory  is  matched  with  routes  already 
existing  in  a  database  using  a  simple  distance  measure.  If  a  match  is  found,  the 
existing  route  is  updated  by  a  weight  update  function;  otherwise  a  new  route 
is  created  for  the  new  trajectory.  One  limitation  of  this  approach  is  that  only 
spatial  information  is  used  for  trajectory  clustering  and  behaviour  recognition. 

Another  popular  technique  for  activity  recognition  is  Bayesian  networks 
[3,1  11.7,14]  In  [7],  supervised  training  using  Bayesian  formulation  is  used  for 
estimating  the  parameters  of  a  multi-layered  finite  state  machine  model  that  is 
proposed  for  activity  recognition.  Very  recently,  Bayesian  frame' work  has  been 
used  for  action  recognition  using  ballistic  dynamics  [15].  This  method  is  based 
on  psycho  kinesiological  observations,  that  is,  on  the  ballistic  nature  of  hu¬ 
man  movements.  Despite  the  fact  that  all  these  approaches  have  demonstrated 
success  in  modelling  and  recognizing  the  activities,  all  these  methods  need  to 
have  a  large  number  of  training  sequences  with  intensive  training  in  order  for 
each  activity  to  be  recognised  correctly  which  is  not  feasible  for  a  real-time 
application 


3  Our  Approach 

Given  a  collection  of  Linlabeled  videos,  we  focus  on  the  problem  of  interpret¬ 
ing  the  output  of  the  object  detection  and  tracking  module  in  order  to  detect 
suspicions  motion  patterns.  The  proposed  approach  is  illustrated  in  Figure  1. 

.3.1  Object  Detection  and  Tracking 

Tlit'  visual  tracking  information  serves  as  the  input  for  our  framework  and  we 
have  employed  the  object  detection  and  tracking  system  presented  in  [9].  1  he 
whole  system  includes  the  following  component:  a  Gaussian  mixture  background 
model,  motion  detection  from  background  subtraction  and  the  appearance  man¬ 
ifold  based  tracking  algorithm  to  extract  the  trace  of  each  object.  The  output  of 
the  tracker  produces  a  set  of  7ti  tracks  { T\ ,  ■  •  •  , TJ,  *  ■  ■  ,77m},  where  every  track 
is  a  set  of  observation  of  the  same  object.  For  instance,  anv  Ith  track  is  a  set 
of  observations  T?  =  { O i .  *  •  •  f  O j ,  •  •  •  .  O,,},  where  Oj  =  contains  the 

displacement  of  an  object  in  the  image  plane  (:r ,y). 

3.2  Adaptive  Piecewise  and  Data  Quantization 

Piecewise  linear  analysis  is  ubiquitous.  Let  us  model  any  nonlinear  uuieompara- 
liietrie  function  g(f)  with  a  constrained  piecewise  linear  function  gpi^f).  A 
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piecewise  function  is  defined  with  N  linear  segments  over  the  interval  [x^.  xjv] 
as 


UPL(f)  = 


7i  (/)  xo  $  /  ^  x\ 
72 (/)  an  ^  ^  *2 


(1) 


7A'(jf)  1  ^  ^  A' 


where  7„(/)  =  an/  +  ,  n  =  l,--  ,7V  is  a  linear  segment  and  Xj  represents 

N  +  1  prespecified  knots  in  [;ro,-Xjv]. 

Given  M  pairs  of  samples  (/m,ym),rn  =  l,--*  ,M  —  1  from  the  image  se¬ 
quences,  the  best-fit  piecewise  linear  function  to  be  the  one  that  minimizes  the 
cost  function 

A/-1 

•/  =  £  (fhn  ~  yPL(fm))2  (2) 

m=0 

To  minimize  this  cost  function  with  respect  to  the  N  —  1  unknown  parameters 
a  i .  •  •  •  ,  we  can  evaluate  4^-  =  0  where  j  =  1 •  •  •  ,  N  —  1  to  get  a  system 

°aj 

of  iV  —  1  simultaneous  equations  in  N  —  1  unknowns: 


A/-1 


If  _  'r^'  ,  ,  ,d(JPL(fm) 

/  J  ffPLKJm)  —  /  /  9m\fm)  ^ 


i=0 


9a, 


where  note  that 


0 


O'H'lXfn 

0(1  j 


n  <  j 
n  =  j 

xj-Xj- 1  j<n<N 

(1  -  h(fm))(xj  -  Xj.  i)  n  =  N 


(4) 


and  n  =  1,  •  •  •  ,  Ar. 

However  when  identifying  constant  intervals  a  posteriori  from  a  piecewise  lin¬ 
ear  model,  we  risk  mis-identifying  constant  intervals  a  posteriori  from  a  piecewise 
linear  model.  In  this  paper,  we  adopted  the  adaptive  piecewise  analysis  [10]  (Al¬ 
gorithm  1)  where  we  first  apply  the  aforementioned  piecewise  linear  analysis  and 
then  we  seek  to  split  the  linear  intervals  into  constant  intervals  (please  refer  to 
Fig.  2).  That  is,  the  algorithms  only  splits  an  interval  if  the  fit  error  can  be 
reduced,  it  is  guaranteed  not  to  degrade  the  fit  error. 

In  this  paper,  we  define  a  segmentation  as  a  sorted  set  of  segmentation  indexes 
2d,  •  •  ■  ,Zk  such  that  zG  =  0  and  Zk  —  n.  The  segmentation  points  divide  the  time 
series  into  intervals  Si,***  ,  S*-  defined  by  the  segmentation  indexes  as  S3  = 
yt)\zj  i  ^  t  ^  Zj}.  The  segmentation  error  is  computed  from 

<3<Sj>  (5) 

where  function  Q  is  the  square  of  the  Euclidean,  / 2  regression  error.  Formally, 
Sj  =  niin  "  -V’  )2  (°) 
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Input:  rI  line  series  (.r,,  y,)  of  length  n 

Input:  Bound  on  polynomial  degree  N  and  model  complexity  k 
Input:  Function  E(p,q.d)  computing  fit  error  with  poly  in  range  [:r7,,.ry) 

S  empty  list 
d  —  N  -  1 

5  —  (0,nfd,£;(0,n,d)) 

6  —  k  -  d 

while  b  d  >  0  clo 

find  tuple  (iyj.d.e)  in  S  with  maximum  last  entry 
find  minimum  of  E(i,L  d)  +  E(lsj<d)  for  l  =  i  +  1,  •  •  •  .j 
remove  tuple  (i.jxe.)  from  S 

insert  tuples  (i, /.d,  E(iJ,d))  and  (Lj.d,E(l,j,d))  in  S 

b  4-  b  4-  d 

end 

for  tuple  (i,  j,q.e)  in  S  do 

find  minimum  m  of  E(i.l,df)  +  E(l.  j.  q  —  d'  —  1)  for  /  =  i  +  1,  -  •  •  ,  j  and 
0  <  d '  <  -1 
if  m  <  e  then 

remove  tuple  (t,  j,  </,  e)  from  S 

insert  tuples  (i.l.d'.E(Lj,d'))  and  (Lj.q  —  d'  -  1,  E(l.  j.  q  —  1))  in  S 

end 

end 

Algorithm  1.  Adaptive  Piecewise  Algorithm 


where  the  minimum  is  over  the  polynomials  p  of  a  given  degree.  For  instance,  if 
the  interval  Sj  is  said  to  be  constant,  therefore 


Q(Sj)  —  y 


(m  -  !))2 


(7) 


where  y  is  the  average,  y  =  2  ~  •  Similarly,  if  the  interval  has  a 

linear  model,  then  p(x)  is  chosen  to  be  linear  polynomial  j)(.r)  =  a.r  +  b  where 
a  and  b  are  found  by  regression.  The  segmentation  error  can  be  generalised  to 
other  norms,  such  as  the  maximuni-error  (E c)  norm  10,32]  bv  replacing  the  V] 
Operator  by  max  operators. 


3.3  Data  Quantization 


In  this  paper,  the  adaptive  piecewise  analysis  is  performed  after  a  sufficient 
amount  of  tracking  data  has  been  accumulated.  The  appropriate  duration,  r 
depends  on  the  amount  of  the  traffic  in  the  scene  and  the  required  accuracy  of 
the  model.  For  instance,  the  adaptive  piecewise  analysis  mT  is  obtained  from  the 
observation  vector  Oj-^j + T  and  empirically,  we  have  chosen  r  =  7.  Following  this, 
a  data-qiiaiitization  process  to  represent  the  outcome  qualitatively  is  conducted 
as  to  Eq.  8.  An  example  of  the  process  is  illustrated  in  Fig.  3. 


f  m  <  0  0 

\  else  1 


(*) 
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Fig.  2.  Example  of  the  Adaptive  Piecewise  Analysis 
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(a)  Local  Adaptive  Piecewise  Analysis  (b)  Data  Quantization 

using 


Fig.  3.  Adaptive  Piecewise  Analysis  and  Data  Quantization 


3.4  Fuzzy  Rule-Based  Anomaly  Detection 

In  order  to  detect  the  anomalous  motion  trajectory,  we  propose  a  fuzzy  rule-based 
system  as  to  Fig.  4,  which  can  automatically  detect  suspicious  trajectories  mov¬ 
ing  in  atypical  paths.  Each  feature  (e.g.  time  and  continuity)  is  passed  through 
a  set  of  fuzzy  membership  functions  to  get  membership  values  corresponding  to 
LOW,  MEDIUM  or  HIGH,  and  finally  the  proposed  fuzzy  rules.  In  our  proposed 
approach,  we  do  not  need  the  whole  trajectory  to  perform  the  anomaly  detection. 
As  mentioned  in  previous  section  in  this  paper,  anomaly  detection  is  performed 
after  a  sufficient  amount  of  tracking  data  has  been  accumulated. 

The  fuzzy  inference  engine  consists  of  6  rules  where  each  of  the  rules  will  gen¬ 
erate  a  response,  corresponding  to  ’Very  Usual’,  ’Usual’, ’Usual  or  Suspi¬ 
cious’.  ’Suspicious’  and  ’Very  Suspicious’.  The  output  membership  function 
corresponding  to  each  of  these  responses  is  shown  in  Fig.  5. 
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1  If  Time  is  HIGH  and  Continuity  is  LOW  then  USUAL  OR  SUSPICIOUS 
2.  If  Time  is  HIGH  and  Continuity  is  MEDIUM  then  USUAL 
T  If  Time  is  HIGH  and  Continuity  is  HIGH  and  then  VERY  USUAI 
1  If  Time  is  LOW  and  Continuity  is  LOW  then  USUAL  OR  SUSPICIOUS 
5.  If  Time  is  LOW  and  Continuity  is  MEDIUM  then  SUSPICIOUS 
0.  If  Time  is  LOW  and  Continuity  is  HIGH  then  VERY  SUSPICIOUS 

where 

—  Time  =  The  difference  between  Timet  and  7Ymei+T 

Continuity  =  The  similarity  of  the  piecewise  linear  analysis  between  Time, 
and  Timei+T 


Fig.  4.  The  Fuzzy  Rules 


4  Experiments 

In  this  section,  we  present  the  effectiveness  of  the  proposed  approach  into  detect¬ 
ing  100  datasets  which  consist  of  both  benign  (normal)  and  potentially  dangerous 
(suspicious)  categories.  The  validation  scenario  is  an  outdoor  environment,  such 
as  a  parking  lot. 

4.1  Experimental  Setup 

First,  the  trajectory  of  a  moving  object  is  extracted  from  the  background  image 
by  subtracting  the  image  of  the  t  racked  object  with  the  background  image  models 
by  Gaussian  mixture.  The  trajectory  obtained  can  be  given  as  follow:  Tt  — 
{0i,-  ,0j,-  -  , Oji } ,  where  Oj  =  Using  the  x-y  coordinate  points, 

we  perform  the  piecewise  Linear  based  on  different  duration.  Next,  we  use  the 
gradient  information,  m  from  the  piecewise  linear  analysis  to  produce  qualitative 
data.  Our  condition  for  the  quantization  process  is  that  if  m  <  0  (positive)  then 
it  will  be  represented  as  T’  and  else  it  will  be  represented  as  ’0\  Finally,  with  the 
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Input:  Trajectory  {(*<,  j/,-)}4=i. 
for  Time ,  T  —  1  > — *  Tt.n(i  do 
CHECK  Tx  ==  Ti+i  ?? 
if  TES  then 

CHECK  CONTINUOUS  triggered  ?? 
while  VES  do 

Suspicious  behaviour  indication  (SBI)  =  SBICumnt  -  20% 
else 

I  SBI  =  50% 

I  SET  CONTINUOUS  =  1 

end 

end 

end 

if  NO  then 

CHECK  NONcONTINUOUS  triggered  ?? 
while  YES  do 

SBI  —  SEIcurrc.nl  20%) 

SET  CONTINUOUS  =  0; 
else 

SB!  =  50% 

SET  CONTINUOUS  =  0 
SET  NONcONTINUOUS  =  1 

end 

end 

end 

end 

return  SBI  for  each  {(#* ,  t/i)}i=si.  ,t 

Algorithm  2.  Anomaly  Detection  Algorithm 


qualitative  data,  anomaly  trajectory  is  detected  by  comparing  to  the  proposed 
rule-based  framework . 

4.2  Results  and  Discussions 

Experiments  are  conducted  to  test  the  effective  of  the  approach  in  detecting 
the  anomaly  path  trajectory  using  proposed  fuzzy  rule-based  framework.  The 
overall  performance  of  the  method  is  tested  against  50  normal  trajectories,  and 
50  suspicious  trajectories.  The  results  are  shown  in  the  Table  1. 

Based  on  these  results,  the  proposed  approach  manages  to  give  the  accuracy 
up  to  90%  depending  on  the  frame  rate  are  used.  The  choice  of  frame  rate  is  em¬ 
pirically  chosen.  This  result  is  considered  as  good  compare  to  other  approaches 
which  required  extensive  training  data  set  and  offline  learning  process  which  is 
computationally  expensive.  Furthermore,  we  also  closely  examined  the  misclassi- 
fied  trajectories  for  each  windows  size  and  noticed  that  most  of  the  misclassified 
trajectories  were  found  in  the  same  tracked  objects,  T7  (please  refer  to  Fig.  G). 
One  of  the  main  reasons  is  due  to  the  distorted  trajectories  points  (noise)  during 
the  tracking  process.  We  felt  that  this  problem  can  be  alleviated  by  using  more 
accurate  tracker  and  this  is  work  in  progress. 
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Fig.  G.  Sample  of  the  Abnormal  'IYajectories  Dataset 
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Table  1.  Classification  Results  with  Different  Frame  Rate 


#of 

Subjects 

Accuracy  for 

Frame  Hate  {fps) 

6  12  25 

Single 

Person 

Normal  Trajectory 

Correctly  Classified 

90%  93%  95% 

Suspicious  Trajectory 

Correctly  Classified 

89%  84%  88% 

Multiple 

Person 

Normal  Trajectory 

Correctly  Classified 

93%  94%  93% 

Suspicious  Trajectory 

Correctly  Classified 

90%  83%  88% 

Total  Accuracy 

92V,  90%  91% 

Correlation  Analysis.  In  the  second  approach,  we  only  consider  the  corre¬ 
lation  analysis  to  perform  anomaly  detection  as  a  comparison.  Correlation  is 
to  measure  the  closeness  of  the  linear  relationship  between  X  and  V  after  the 
regression  process.  In  this  paper,  we  employed  the  Pearson's  product-moment 
correlation  coefficient  to  measure  this  linear  relationship  (Eq.  9). 

/?  _  _ »EJn.'/n  ~  E  J'v  Ej/n _  ^ 

v/»E*n2-(E®n)V”i:^2-(S  Vn? 

The  value  for  correlation  coefficient  R  can  be  varied  from  1  to  -1  depending  on 
the  data. 

—  R.  =  0  is  no  linear  correlation 

—  R  =  1  is  perfect  +vc  linear  correlation  (+ve  gradient) 

—  R  =  -1  is  perfect  -ve  linear  correlation  (-ve  gradient) 

From  this  measure,  we  calculated  the  deviation  of  the  path  from  the  obtained 
regression  line.  This  would  mean  that  if  the  object  tracked  is  in  a  normal  path 
trajectory  the  R  value  will  be  relatively  close  to  1  or  -1  and  if  the  object  tracked 
is  in  the  abnormal  path  trajectory  the  R  will  be  closer  to  0.  Thirty  normal 
path  trajectories  have  been  analysed  and  the  correlation  coefficient  are  shown 
in  Table  2.  From  the  Table  2,  it  can  be  noticed  that  the  R  range  is  lied  between 
0.91  <  R  <  0.53  for  positive  correlation  and  —0.52  <  R  <  —0.82  for  negative 
correlation.  Same  approach  is  used  to  analyse  nine  abnormal  path  trajectories 
and  the  results  are  Table  3.  R  range  for  abnormal  path  trajectories  are  0.85  < 
R  <  0.22  and  -0.12  <  R  <  —0.87.  However,  these  results  do  not  give  any 
significant  correlation  value  to  distinguish  between  normal  and  abnormal  path 
trajectories  as  the  correlation  range  for  abnormal  path  are  overlap  with  the 
normal  path  correlation  range. 
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Table  2.  Correlation  Coefficient,  R  for  Each  Normal  Trajectory  Dataset 


Dataset,  St 

Correlation 
Coefficient,  R 

Dataset,  St 

Correlation 
Coefficient.  R 

Dataset,  S, 

Correlation 
Coefficient,  R 

I 

0.7703 

11 

0.5913 

21 

-0.4666 

2 

0.8464 

12 

0.7884 

22 

-0.6628 

3 

0.8592 

13 

0.6049 

23 

-0.7293 

4 

0.9105 

14 

-0.8239 

24 

-0.6131 

5 

0.8426 

15 

-0.7176 

25 

-0.6285 

6 

0.8128 

16 

-0.5652 

26 

-0.5104 

7 

0.8181 

17 

-0.6895 

27 

0.9119 

8 

0.7953 

18 

-0.6767 

28 

0.7566 

9 

0.5371 

19 

-0.5910 

29 

-0.6651 

10 

0.6832 

20 

-0.6949 

30 

-0.7339 

Table  3.  Correlation  Coefficient,  R  for  Each  Abnormal  Trajectory  Dataset 


Dataset.  S, 

Correlation 

Dataset,  .S', 

Correlation 

Dataset.  5, 

Correlation 

Coefficient.,  R 

Coefficient,  ft 

Coefficient,  /? 

1 

0.2212 

4 

0.8550 

7 

-0.6886 

2 

-0.7137 

5 

-0.8695 

8 

0.4593 

3 

-0.6278 

6 

-0.7896 

9 

-0.1164 

5  Concluding  Remarks 

In  this  paper,  we  presented  the  hybrid  adaptive  piecewise  linear- fuzzy  rule-based 
anomalous  trajectory  detection  algorithms  and  experimental  results  using  vari¬ 
ous  challenging  trajectories  lias  validated  the  proposed  method.  Our  aim  in  this 
presentation  has  been  to  motivate  the  need  for,  and  challenges  involved  in,  the 
detection  of  anomalous  temporal  data  resulting  from  object  tracking  captured. 
The  proposed  algorithm  is  significant  over  the  state-of-thc  art  methods  in  a  wav 
that  l)no  extensive  training  and  learning  are  required  and  2)the  anomaly  detec¬ 
tion  is  performed  as  the  incoming  data  are  acquired,  therefore  avoid  the  classical 
two-step  approaches  (data  collection  and  off-line  processing).  Our  future  work 
will  focus  on  automatically  extracting  the  rules  explaining  the  phenomena  hid¬ 
den  into  the  input  data,  for  trajectory  analysis  and  introduce  tin*  interactions 
between  objects  to  the  trajectory  patterns. 
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Abstract.  Manifold  Learning  lias  attracted  milch  attention  for  this 
decade.  One  of  the  main  features  of  Manifold  Learning  is  that  Manifold 
Learning  trios  to  conserve  local  topologies  in  high-dimensional  space.  1  n 
this  paper,  we  discuss  the  effect  of  the  dimensionality  reduction  of  input 
spaces  of  Evolutionary  Learning.  We  examine  two  Manifold  Learning 
algorithms:  lsoinap  and  LLE.  We  adopt  the  Instance)- Based  Policy  Opti¬ 
mization  as  an  Evolutionary  Learner.  In  addition,  we  introduce  a  metric 
of  relative  error  of  distances  between  original  input  space  and  reduced 
space.  We  will  show  the  relationship  between  this  metric  and  the  number 
of  neighbors  in  Manifold  Learning. 


1  Introduction 

In  this  study,  we  investigate  the  effect,  of  the  dimensionality  reduction  in  Evo¬ 
lutionary  Learning.  In  evolutionary  learning,  the  alignment  of  sensors  is  a  key 
issue  to  design  effective  intelligent  agents/robots.  It  is  impossible  to  solve  prob¬ 
lems  with  insufficient,  sensor  information  while  redundant  sensory  inputs  causes 
considerable  amount  of  learning  time. 

In  this  paper,  lsoinap  or  LLE  (Locally  Linear  Embedding),  one  of  Manifold 
Learning  Algorithms,  is  used  to  reduce  the  number  of  dimensionality  of  sensory 
inputs  [L2].  By  using  the  reduced  inputs,  agents  decide  their  actions  and  learn 
policies  to  achieve  a  given  task.  An  important  feature  of  the  Manifold  Learning 
Algorithms  is  to  preserve  local  topological  relationship  among  data.  Fig.  1.  for 
instance,  depicts  the  S-shaped  data,  which  is  often  used  to  explain  the  effective¬ 
ness  of  the  Manifold  Learning.  The  left  graph  in  this  figure  denotes  original  data 
in  a  three-dimensional  space,  which  are  sampled  from  a  two-dimensional  mani¬ 
fold.  Note  that  colors  of  points  have  no  special  meanings.  They  are  just  for  ease 
of  understandings.  The  right  graph  in  the  figure  is  a  typical  result  by  Manifold 
Learning  for  the  original  data.  The  order  of  color  sequence  is  maintained  in  this 
resultant  two  dimensional  data. 

In  this  paper,  we  propose  a  two-stage  learning  method  for  mobile  robots: 
The  first  stage  is  to  learn  the  mapping  from  high  dimensional  sensory  inputs  to 
low  dimensional  data.  The  high  dimensional  sensory  inputs  are  collected  by  the 
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Fig.  1.  S-shaped  data  (LEFT)  and  the  typical  result  by  Manifold  Learning  (RIGHT) 


elopement  in  another  environment  of  tasks  where  there  are  various  kinds  of  ob¬ 
stacles.  The  Manifold  learning  is  used  to  generate  the  low  dimensional  data.  The 
second  state  is  to  learn  the  policy  of  robots  by  using  Evolutionary  Algorithms. 
At  every  time  step,  the  robots  perceive  the  high  dimensional  sensory  inputs 
as  the  same  as  in  the  first  stage.  Then,  the  low  dimensional  data  associated 
with  perceived  the  high  dimensional  data  is  given  to  an  individual  in  the  Evo¬ 
lutionary  Algorithms.  This  paper  examines  various  combinations  of  parameters 
for  IBP  (Instance-Based  Policy  Learning)  with  dimension  reduction  algorithms. 
Especially,  we  investigate  the  relationship  between  relative  errors  in  dimension 
reduction,  and  the  number  of  neighbors  k.  In  addition,  we  compare  the  proposed 
method  with  evolutionary  learning  with  hand-tuned  sensors. 

Related  works  are  described  as  follows:  Dimension  reduction  techniques  in¬ 
cluding  SOM  arc  often  used  in  conventional  reinforcement  learning  community 
and  as  genetic  operations  or  visualization  tools  of  individuals  in  Evolutionary 
Optimization  [3, 4, 5, 6, 7].  In  the  case  of  Evolutionary  Learning,  there  is  few  re¬ 
search.  We  can  guess  some  reasons  of  this:  One  of  main  stream  of  applying 
Evolutionary  Learning  to  robotics  is  of  Learning  Classifier  Systems  (LCS)  [8].  In 
the  case  of  LCS,  schemata  are  quite  important  notion  of  them.  If  we  use  dimen¬ 
sion  reduction  techniques,  it  would  he  difficult  to  constitute  effective  schemata. 
Another  evolutionary  approach  is  use  of  Neural  Networks,  i.e.,  NeuroE volution 
[9].  In  this  case,  they  would  rely  on  the  information  processing  capability  of 
Neural  Networks  for  non-linear  phenomena.  In  robotics,  Manifold  Learning  have 
attracted  much  attention  for  generating  Maps  [10].  Our  research  can  be  regarded 
as  an  extension  of  this  study  to  Evolutionary  Learning. 

2  Manifold  Learning 

The  first  generation  of  Manifold  Learning  algorithms,  i.e..  Locally  Linear  Em¬ 
bedding  and  Isomap,  is  proposed  in  2000  [1,2,11].  These  have  attracted  much 
attention  especially  in  image  processing  community  since  these  can  embed  the 
relationship  among  a  large  number  of  images  into  two  dimensional  space  natu¬ 
rally.  Hence,  several  subsequent  algorithms  have  been  proposed  such  as  Laplacian 
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Eigenmaps,  Hessian  Eigenmaps,  and  so  on  In  this  paper,  in  order  to  investigate 
the  effectiveness  of  the  information  processing  on  Manifolds,  we  employ  basic 
Manifold  Learning  Methods,  i.e.,  Isomap,  and  LLE. 

2.1  Locally  Linear  Embedding 

The  LLE  algorithm  tries  to  maintain  the  local  topology  in  reduced  space.  As 
mentioned  below,  the  LLE  algorithm  is  based  on  linear  algebra  for  calculating  the 
positions  in  the  reduced  space  while  it  can  achieve  highly  nonlinear  embeddings. 
The  LLE  algorithms  is  executed  as  follows  [11]: 

L  Assign  neighbors  to  each  data  point  xt. 

2.  Compute  the  weights  ir7j  that  best  linearly  reconstruct  x,  from  its  neighbors, 
by  solving  this  equation: 


3 


3.  Compute  the  low-dimensional  embedding  vectors  yt  by  using  the  weights 
iL'ij  and  the  following  equation: 


(p(y)  =  iy>  —  XZ 


*  3 


2.2  Isomap 

Isomap,  proposed  by  Tenenbanm  ct  al.  is  one  of  the  most  famous  Manifold 
Learning  Algorithms  [l]  In  the  Isomap,  the  geodesic  distance  on  Manifolds  is 
used  instead  of  the  Euclidean  distance.  The  procedure  of  the  Isomap  is  described 
as  follows: 

1.  K- Nearest  Neighbor  method  is  adopted  all  the  input  data  x,.  Then,  a  neigh¬ 
borhood  graph  x  is  constructed  such  that  nodes  in  the  graph  is  connected  if 
they  are  of  neighbor  in  the  sense  of  K-Nearest.  Neighbor  method.  Distance 
dc(i,  j)  of  edge  among  connected  nodes  is  set  to  be  dx(ii  j).  i.e.,  Euclidean 
distance  between  the  input  data  :rt  and  Xj. 

2.  For  all  the  pair  s  Xi,Xj  of  input  data,  the  shortest  path  distance  dc(uj)  on 
the  neighborhood  graph  G  are  calculated. 

3.  A  low  dimensional  projection  is  generated  by  calling  a  metric  AIDS  (Multi 
Dimensional  Scaling)  and  by  using  the  the  shortest  path  distance  dc (i .  j ) . 

3  Instance-Based  Policy  Learning 

The  instance  based  policy  learning  proposed  by  Miyainae  is  an  evolutionary 
approach  for  solving  reinforcement  learning  problems  [12].  It  is  composed  of 
several  vectors,  called  instances.  Each  instance  consists  of  a  state  part  and  an 
action  part.  For  a  given  perceptual  input  at  each  time  step,  the  nearest  instance 
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Fig.  2.  Example  of  an  Individual:  Instances  in  perceptual  input,  space 


instance  1  s>>  1  £  |  si* J  a  1  a  •••• 

instance  2[~^  \  s*-  }•••*[  s:n,t\  an  \  a  | 

instance  /  [  su  ••••  s*  ,  j  c/r~[  gt?  ••••  a* » 

Fig.  3.  Representation  of  Individuals  in  Instance  Based  Policy  learning 


is  activated  as  nearest  neighbor  method.  The  action  of  the  activated  instance  is 
taken.  Fig.  2  depicts  an  example  of  instances  in  perceptual  input  space,  where 
the  dimensions  of  states  and  actions  are  2  and  1,  respectively.  The  position  of 
circles  and  the  orientation  of  arrows  denote  the  state  part  and  the  action  part 
of  instances,  respectively.  As  delineated  in  the  figure,  the  perceptual  input  space 
is  segmented  into  several  subspaces  as  in  Voronoi  diagrams.  Each  subspace  is 
associated  with  one  of  instance.  That  is,  each  of  instance  activates  for  perceptual 
inputs  in  a  corresponding  subspace.  The  arrows  in  the  figure  illustrate  actions 
for  corresponding  instances. 

Fig.  3  describes  the  genotype  for  the  Instance  Based  Policy  learning.  Sjj  and 
denote  the  jth  element  of  state  vector  and  the  ktu  element  of  action  vector 
of  7th  instance.  /  indicates  the  number  of  instances,  which  is  predefined.  ns  and 
na  represents  the  number  of  states  and  actions,  respectively.  All  the  variables 
$ij  and  (iik  are  represented  by  a  real  value.  Hence,  any  Evolutionary  Algorithms 
for  continuous  function  optimization  problems  can  be  used.  This  paper  utilizes 
CMA-ES  (Covariance  Matrix  Adaptation  Evolution  Strategies)  while  the  orig¬ 
inal  paper  of  the  IBP  learning  method  uses  the  Real-Coded  GA  proposed  by 
their  research  group  for  evolution  [12,13].  The  reason  of  the  utilization  is  due 
to  the  availability  of  the  source  code.  We  believe  there  is  no  significant  differ¬ 
ence  between  the  CMA-ES  and  the  Real-Coded  GA  since  we  do  not  have  to 
find  out  the  optimal  policies  with  high  degree  of  precision  as  in  ordinal  function 
optimization  problems. 
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4  Proposed  Method 

A  two-stage  learning  method  for  mobile  robots  is  proposed  in  this  paper  as 
depicted  in  Fig.  1.  The  first  stage  is  to  constitute  a  mapping  from  sensory  inputs 
x  to  low  dimensional  data  y.  The  second  stage  is  Evolutionary  Learning  to 
achieve  a  given  task.  In  this  stage,  at  every  time  step  sensory  inputs  xt  is 
transformed  to  corresponding  input  yf  by  using  the  mapping.  Hence,  the  inputs 
for  the  learner  is  yr 


Manifold 


Fig.  4.  Diagram  of  the  proposed  method 


4.1  Constitution  of  Mapping 

At  first,  data  collection  is  carried:  A  robot  moves  in  a  given  environment  around. 
In  this  paper,  we  set  up  another  environment  for  this  elopement,  where  there 
are  variety  of  obstacles.  After  a  large  number  of  sensory  inputs  are  gathered, 
data  with  no  activated  sensors  are  eliminated.  Moreover,  a  predefined  number 
of  data  is  randomly  chosen  from  the  eliminated  data  set. 

The  dimension  reduction  method  is  carried  out.  for  the  chosen  data.  This  paper 
examines  not  only  Isomap  but  also  Kernel  PCA  algorithms  for  this  purpose'  [14]. 
The  chosen  data  Xi  and  the  reduced  data  y,  are  associated,  where  i  1 .... .  n,/. 
and  n<t  indicated  the  number  of  the  chosen  data. 

4.2  Transformation  of  Sensory  Inputs  in  Evolutionary  Learning 
Phase 

As  mention  above,  at  every  time  step  /,  sensory  inputs  xt  should  be  transformed: 
Firstly,  the  nearest  and  the  second  nearest  point  x\  x"  from  the  chosen  data  for 
Xt  is  found  out.  Secondly,  the  current  sensory  inputs  xt  is  projected  to  the  line 
defined  by  two  points  x'.x The  projected  point  x*  is  regarded  as  the  relative 
position  a  on  the  line  as  delineated  in  Fig.  5: 

x*  -  xf 
°  =  ®"  - 

The  inputs  y{  for  individuals  are  defined  as  follows: 

v,  =  «(y"  -  y')  +  y'- 

where  y'  and  y"  arc  points  in  reduced  space,  which  are  associated  with  xf  and 
x” .  respectively. 
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Fig.  5.  Transformation  of  sensory  inputs 


In  the  case  that  the  dimension  of  sensory  inputs  is  high  and  the  number  of 
chosen  data  is  large,  it  takes  much  time  for  finding  out  the  nearest  and  the 
second  nearest  data  x'.x".  Locality-Sensitive  Hashing  (LSH)  is  used  for  finding 
such  nearest  points  effectively  [15]. 

5  Experiments 

5.1  Configuration  of  Robots 

We  employ  Simbad,  a  Java  3d  robot  simulator,  for  constructing  simulated  envi¬ 
ronment  [1G].  The  mobile  robot  used  in  this  paper  is  described  as  follows:  The 
radius  of  the  robot  is  0.3  meters.  20  time  steps  per  second  arc  simulated.  The 
robot  has  a  large  number  of  sonar  sensors.  We  examined  72  or  12  sonar  sensors 
for  the  proposed  method.  The  allocation  of  these  sensors  arc  the  same:  The  first 
sensor  is  set  to  be  in  the  front  of  the  robot.  Other  remaining  sensor  is  eqniangn- 
larly  allocated,  i.c.,  at  every  5  degree  for  72  sonar  sensors.  The  range  of  sensors 
is  1.5  meters.  The  robot  goes  forward  with  0.5  meters  per  second  if  there  is  no 
activated  sensor.  Otherwise  the  transformation  and  the  rotational  velocity  of  the 
robot  is  set  to  be  0.2  meters  per  second  and  (a  —  0.5)  x  tt/2  meters  per  second, 
respectively,  where  a  denotes  the  action  of  agent.  The  action  a  in  this  paper 
continuously  varies  from  0  to  1 . 

5.2  Dimension  Reduction 

Fig.  6  delineated  the  simulated  environment  for  collecting  variety  of  sensory 
inputs.  The  size  of  the  field  is  18  meters  x  18  meters.  Various  size  of  walls 
and  blocks  are  stored.  A  robot  with  72  sonar  sensors  moves  in  this  field  around 
for  sufficient  time.  As  mentioned  in  the  previous  section,  data  for  dimension 
reduction  is  randomly  chosen.  The  number  of  chosen  data  n d  is  set  to  be  1000. 
We  generate  10  kinds  of  datasets  by  using  different  random  seeds  for  this  choice. 
From  this  1000  data  for  72  sonar  sensors,  we  generate  other  kinds  of  dataset  by 
neglecting  certain  sensor  values,  i.e..  data  data  for  12  sonar  sensors. 

Several  mappings  are  generated  by  using  Isomap  and  LLE.  As  mentioned 
above,  we  now  have  2  kinds  of  datasets,  where  each  dataset  is  composed  of  10 
sub-datasets  with  different  random  seeds.  For  each  dataset,  dimension  reduction 
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Fig.  6.  Simulated  environment  for  collecting  a  variety  of  sensory  inputs 


Fig.  7.  Relative  errors  for  various  couples  of  the  dimension  of  the  reduced  spaces  and 
the  number  of  neighbors:  the  upper  and  lower  graphs  are  of  that  the  number  of  sonar 
sensors  art'  12  and  72.  respectively;  LLE  (LEFT),  and  Isomap  (RIGHT) 


method  is  carried  out.  The  dimensions  of  reduced  space  are  set  to  be  2.  3,  4,  5, 
7,  10  and  15.  Iii  the  case  of  the  dataset  for  12  sonar  sensors,  we  did  not  apply 
t  he  15  dimensions  of  reduced  space.  In  addition,  we  examined  various  numbers 
of  neighbors  k  for  Lsomap  and  LLE.  k  =  5,  10,  15,  20,  and  30  are  examined. 

We  introduce  the  relative  error  to  evaluate  the  reduced  space.  This  relative 
error  is  calculated  as  follows: 

fT(\  \n  —  V"  V"  XJ )  A  (yr  Vj)  1 

i  jht\  Da(.x>'xj)"A"<i  + 1)/2  ' 
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where  Dg( •.  •)  and  Dc(-,)  indicate  the  geodesic  distance,  estimated  by  Lsomap, 
in  original  space,  and  the  Euclid  distance  in  the  reduced  space.  Note  that  the 
LLE  algorithm  does  not  use  the  geodesic  distance  at  all.  However,  the  LLE 
uses  the  notion  of  neighbors  so  that  this  metric  is  also  useful  for  the  LLE. 
Fig.  7  depicts  the  relative  errors  for  various  couples  of  the  dimension  of  the 
reduced  spaces  and  the  number  of  neighbors  k.  These  values  are  averaged  over 
10  datasets  which  arc  randomly  chosen  with  various  random  seeds.  As  increasing 
the  number  of  neighbors,  the  corresponding  relative  errors  are  decreasing.  In 
addition,  as  increasing  the  dimensions  of  the  reduced  spaces,  the  corresponding 
relative  errors  are  also  decreasing. 

5.3  Experimental  Results 

We  employ  all  the  combinations  of  (algorithm,  the  number  of  sonar  sensors, 
the  dimension  of  reduced  space,  the  number  of  neighbors  k)  as  indicated  in 
the  previous  subsection:  The  algorithm  is  either  of  LLE  or  Isomap.  12  or  72 
sonar  sensors  are  examined.  These  algorithms  reduce  the  dimensions  of  original 
input  space,  which  is  equivalent  to  the  number  of  sonar  sensors,  into  2,  3.  4, 
5,  7,  10,  or  15  dimensional  space.  For  12  sonar  sensors,  the  reduction  to  15 
dimensional  space  is  not  carried  out.  k  is  set  to  be  either  of  5,  10,  15,  20  or  30. 
For  comparison,  Kernel  PCA  is  also  examined.  The  Kernel  PCA  does  not  use 
the  notion  of  neighbors  so  that,  except  for  A  .  similar  combinations  of  parameters 
as  mentioned  above  are  examined. 

Two  simulated  environments  are  examined  as  depicted  in  Fig.  8.  In  these 
depictions,  a  robot  is  located  on  the  initial  position.  The  goal  area  is  located  at 
the  red  lino  in  the  left  side  of  these  Hgures.  500  seconds  (equivalent  to  10,000 
steps)  are  allowed  to  use  for  a  single  examination.  The  episode  will  be  terminated 
if  the  robot  reaches  the  goal,  the  robot  bumps  obstacles,  or  500  seconds  arc 
exceeded. 

The  evaluation  of  a  single  examination  is  calculated  as  follows:  The  following 
function  e  is  applied  if  a  robot  reaches  to  the  goal. 

e  =  0.1  T  1.0/(No.  steps) 

The  second  term  in  the  right  side  of  this  equation  is  a  very  small  number  in 
comparison  with  the  first  term,  i.e.,  0.1.  Therefore,  the  evaluation  for  success  is 
almost  0.1  but  it  is  greater  if  the  robot  could  reach  to  goal  faster.  Otherwise, 
the  evaluation  is  as  follows: 

e  —  —  1.0/(No.  steps)  x  (distance  to  the  goal) 

This  evaluation  is  a  very  small  negative  number.  The  evaluation  is  worse  if  the 
robot  bumps  promptly  or  the  robot  could  not  get  up  close  to  the  goal.  For  a 
single  fitness  evaluation,  five  examinations  are  carried  out.  The  fitness  function 
of  individuals  is  calculated  by  the  sum  of  five  evaluations. 

The  number  of  instances  in  the  Instance  Based  Policy  optimization  is  set- 
to  be  5.  71*  is  the  same  as  the  dimensions  of  reduced  space.  nn  is  1,  i.e.,  ac¬ 
tion  a  in  the  previous  subsection.  The  length  of  individuals  is  (■/?,,  *f  n„)  * 
(the  number  of  instances). 
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Fig.  8.  Simulated  Environments 


Fig.  9,  Experimental  results:  \veraged  Fitness  after  evolution;  Simulated  environments 
with  no  obstacles;  (LEFT;  the  number  of  sensors  is  12;  RIGHT:  72) 


lvig.  9  shows  the  experimental  results  for  the  simplest  simulated  environment., 
i.e.,  the  left  picture  in  Fig.  8.  The  left  and  right  graphs  mean  that  the  number  of 
sonar  sensors  are  12  and  72,  respectively.  The  horizontal  axis  denotes  the  aver¬ 
aged  fitness  values  after  evolution  for  corresponding  combinations  of  (algorithm, 
the  number  of  neighbors  k,  the  dimension  of  the  reduced  space).  1  he  best  perfor¬ 
mances  among  them  are  (Isomap,  20.  2)  for  12  sonar  sensors,  and  (Isomap,  30, 
2)  for  72  sonar  sensors.  In  the  same  parameters  for  algorithm  and  the  dimension 
of  the  reduced  space,  as  increasing  k\  the  performance  tends  to  increase.  This 
tendency  is  similar  to  the  relative  error  shown  in  subsection  5.2. 

Fig.  10  shows  the  experimental  results  for  the  simulated  environment  with  one 
obstacle,  i.e.,  the  right  picture  in  Fig.  8.  It  is  difficult,  to  clearly  see  the  tendency 
as  in  the  result  for  the  simulated  environment  without  obstacles.  However,  in 
the  case  of  Isomap,  larger  k  causes  better  results.  LLE  does  not  work  well  for 
this  environment.  For  72  sonar  sensors,  performances  are  deteriorated  for  all  the 
algorithms  if  the  dimension  of  reduced  spaces  is  around  5.  As  we  can  set1  in  Fig.  7, 
the  relative  error  is  improved  if  the  dimension  of  reduced  spaces  is  increasing. 
However,  such  increase  causes  the  performance  deterioration  at  Evolutionary 
Learning  phase.  In  the  case  of  this  simulated  environment,  with  72  sonar  sensors, 
there  would  be  snboptimal  at  that  the  dimension  of  reduced  spaces  is  10. 
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Fig.  10.  Experimental  results:  Averaged  Fitness  after  evolution;  Simulated  environ¬ 
ments  with  one  obstacle;  (LEFT:  the  number  of  sensors  is  12;  RIGHT:  72) 


Config.  1  Config  2  Config  3 


Fig.  11.  Hand-Tuned  sensor  configurations  for  conventional  IBP  method 


Fig.  12.  Experimental  results:  The  changes  of  fitness  by  IBP  with  hand-tuned  sensors: 
Simulated  environments  without  obstacles  (LEFT):  with  one  obstacle  (RIGHT) 


Finally,  we  compare  the  proposed  methods  with  IBP  with  hand-tuned  sen¬ 
sor  allocations.  We  show  here  three  sensor  configurations  as  shown  in  Fig.  11. 
Fig.  12  shows  experimental  results  of  this  comparison.  The  left  and  right  graphs 
show  the  result  of  the  simulated  environment  without  obstacles  and  with  one 
obstable,  respectively.  In  the  simple  environment,  IBP  with  hand-tuned  sensors 
can  acquire  optimal  policy  rapidly.  On  the  other  hand,  the  IBP  with  configura¬ 
tion  1  and  3  cannot  solve  the  simulated  environment  with  one  obstacle  well.  The 
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IBP  with  configuration  2  works  well  while  its  performance  is  worse  than  t  he  one 
of  Isomap  and  Kernel  PC  A.  These  configurations  may  not  be  optimal  one  since 
we  only  examined  over  20  configurations.  These  results,  however,  elucidate  t lie 
difficulties  of  the  allocation  of  sensors  for  general  purpose. 

6  Conclusions 

In  this  paper,  we  examined  various  combinations  of  parameters  for  1 15 P  with 
dimension  reduction  algorithms:  two  kinds  of  Manifold  Learning  algorithms,  i.e, 
Isomap  and  LLE;  the  number  of  neighbors  k;  the  number  of  sensors,  i.c.,  the 
dimension  of  original  input  space;  the  dimension  of  reduces  spaces.  We  intro¬ 
duced  the  relative  error  to  investigate  how  the  dimension  reduction  worked  well. 
For  the  number  of  neighbors,  as  the  relative  errors  are  decreasing,  the  fitness 
tends  to  he  improved.  However,  in  terms  of  the  dimension  of  reduced  spaces, 
such  tendency  could  not  be  observed:  The  relative  errors  are  decreasing  if  the 
dimension  of  reduced  spaces  is  increasing.  At  the  time,  the  performance  is  also 
deteriorated.  One  of  this  reason  is  that  the  length  of  individuals  are  growing  in 
proportion  to  the  dimension  of  reduced  space. 

In  addition,  we  compared  with  IBP  with  hand-timed  sensors.  This  experiment 
reveal  the  difficulty  of  sensor  allocations  with  several  sensors  for  general  purpose. 
That  is.  the  proposed  method  can  avoid  such  difficulty  efficiently. 

Future  works  are  described  as  follows:  The  proposed  method  is  two-staged 
algorithm,  that  is,  batch-process  is  adopted.  It  would  be  better  to  apply  on-line 
version  of  Manifold  Learning  for  practical  application.  In  this  case,  during  evolu¬ 
tion.  the  meanings  of  input  value  could  bo  changed  by  Manifold  Learning.  Some 
isomorphism  mechanisms  should  be  devised.  We  may  be  able  to  incorporate  the 
geodesic  distance  into  Evolutionary  Learning,  instead  of  the  use  of  Manifold 
Learning.  In  this  case,  we  need  to  take  account  into  the  curse  of  dimensionality. 
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Abstract.  We  propose  a  new  approach  for  a  real-time  personal  authentication 
system,  which  consists  of  a  selective  face  attention  model,  incremental  feature 
extraction,  and  an  incremental  neural  classifier  model  with  long-term  memory. 
In  this  paper,  a  face-color  preferable  selective  attention  combined  with  the 
Adaboost  algorithm  is  used  to  detect  human  faces,  and  incremental  principal 
component  analysis  (1PCA)  and  resource  allocating  network  with  long-term 
memory  (RAN-LTM)  arc  effectively  combined  to  implement  real-time  personal 
authentication  systems.  The  biologically  motivated  face-color  preferable 
selective  attention  model  localizes  face  candidate  regions  in  a  natural  scene,  and 
then  the  Adaboost  based  face  detection  process  identifies  human  faces  from  the 
localized  faee-candidate  regions.  IPCA  updates  an  eigen-  space  incrementally 
by  rotating  eigen-axes  and  adaptively  increasing  the  eigen  space  dimensions. 
The  features  extracted  by  projecting  inputs  to  the  cigcn-space  are  given  to 
RAN-LTM  which  learns  facial  features  incrementally  without  unexpected 
forgetting  and  recognizes  faces  in  real  time.  The  experimental  results  show  that 
the  proposed  model  successfully  recognizes  200  human  faces  through 
incremental  learning  without  serious  forgetting. 

Keywords:  person  authentication,  face  detection,  selective  attention,  saliency 
map,  incremental  learning,  principal  component  analysis,  RBF  networks. 


1  Introduction 

Recently,  biometrics  features  have  been  broadly  used  as  a  means  to  authenticate 
user’s  identity.  There  have  been  considered  various  biometries  features  to  represent 
user’s  characteristics  such  as  fingerprints,  iris  patterns,  facial  features,  hand 
silhouettes  which  have  their  own  merits  and  demerits  for  real  world  applications. 
Among  authentication  schemes  using  faeial  biometrie  features,  the  eigen-face 
approach,  in  which  eigenvectors  are  computed  to  transform  face  image  data  into  low¬ 
dimensional  features,  are  widely  adopted  for  faee  recognition  systems.  The  biometries 
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using  facial  information  is  one  of  the  promising  approaches  to  implementing  a 
reliable  system  for  personal  authentication. 

However,  one  of  the  difficulties  to  implement  a  facial  feature  based  authentication 
system  is  to  enhanee  the  robustness  over  the  spatial  and  temporal  variations  of  human 
faees  due  to  the  growth  (or  aging)  and  the  ehanges  in  lighting  conditions,  faee 
directions,  expressions,  make-up,  and  so  forth.  Conventional  personal  authentication 
systems  can  aehieve  excellent  performance  when  the  system  is  tested  over  a 
benchmark  dataset.  However,  it  could  drop  rather  drastically  when  they  are  operated 
in  a  praetieal  environment.  This  is  because  the  training  set  of  face  images  will  be 
either  insufficient  or  inappropriate  for  future  events. 

Even  if  a  large  amount  of  faec  images  arc  available  during  the  construction  of  a 
personal  authentication  system,  it  is  unlikely  that  all  the  variations  that  will  happen  in 
future  eould  be  considered  in  advance;  thus  reliable  performance  of  the  authentication 
system  in  praetieal  situations  ean  hardly  be  expected  with  only  a  statie  dataset.  In  this 
paper,  as  a  solution  for  this  problem,  we  propose  a  new  personal  authentication 
system  that  ean  learn  continuously  to  adapt  to  incoming  new  training  human  faces. 
This  ean  be  done  by  embedding  an  incremental  learning  ability  for  both  the  feature 
extraction  part  and  the  classification  part. 

This  paper  is  organized  as  follows;  Section  2  describes  the  proposed  incremental 
personal  authentication  system  whieh  consists  of  the  bottom-up  face  detection 
using  faee  eolor  preferable  attention  for  selecting  faee  candidate  areas  [11,  the 
incremental  learning  of  the  feature  extraction  part  using  incremental  principal 
component  analysis  (1PCA)  [2,  3],  and  the  incremental  learning  of  a  neural  classifier 
ealled  resource  allocating  network  with  long-term  memory  (RAN-LTM)  [3).  The 
experimental  results  will  be  followed  in  Section  3.  Section  4  presents  our  conclusions 
and  diseussions. 


2  Incremental  Personal  Authentication  System 

Figure  1  shows  the  proposed  incremental  personal  authentication  system.  At  first,  we 
simply  consider  a  skin  color  preferable  attention  model  for  face  color  perception  and 
Haar-like  form  features  for  face  form  perception,  in  whieh  all  processes  work  in  real 
time  [1,4]. 

A  biologically  motivated  selective  attention  model  with  face-color  preference  ean 
decide  faee  candidate  areas  in  a  complex  input  scene.  For  the  selected  face  candidate 
regions,  an  AdaBoost  algorithm[4]  using  the  Harr-like  form  feature  is  applied  to 
selectively  localize  human  faces  not  in  all  regions  of  the  input  scene  but  only  in  the 
faee  candidate  areas  obtained  by  the  faee  eolor  preferable  selective  attention  model. 
Thus,  we  use  a  face  candidate  loealizer  based  on  the  biologically  motivated  bottom-up 
salieney  map  (SM)  model  [5[,  Second,  we  adopt  1PCA  for  facial  feature  extraction 
eondueted  in  an  online  way  [2,  3].  Finally,  we  introduce  a  neural  classifier  ealled 
RAN-LTM  which  learns  facial  features  incrementally  without  unexpected  forgetting 
and  recognizes  faees  using  eigen-features  obtained  by  1PCA  [3],  The  detail  processing 
in  each  part  is  described  below. 
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Fig.  1.  The  block  diagram  of  processing  in  the  incremental  personal  authentication  system 

2.1  Selective  Attention  Model  with  AdaBoost  for  Human  Face  Detection 

In  order  to  implement  a  human-like  efficient  visual  selective  attention  function,  we 
consider  the  bottom-up  saliency  map  (SM)  model  proposed  in  [6J.  The  SM  model 
reflects  the  functions  of  the  retina  cells,  the  lateral  geniculate  nucleus  (LGN)  and  the 
visual  cortex.  Since  the  retina  cells  can  extract  edge  and  intensity  information  as  well 
as  color  opponeney,  we  use  these  factors  as  the  basic  features  of  the  SM  model  [6-8]. 
In  order  to  take  the  face  color  preference  property  into  consideration,  the  skin  color 
filtered  [9]  intensity  feature  is  considered  together  with  the  original  intensity  feature. 
Depending  on  a  given  task  to  be  conducted,  those  two  intensity  features  are 
differently  biased.  For  face  preferable  attention,  a  skin  color  filtered  intensity  feature 
works  for  a  dominant  feature  in  generating  an  intensity  feature  map.  And  the  real 
color  components  red(R),  green(G),  blue(B),  yellow(Y)  arc  extracted  using 
normalized  color  coding  (7).  According  to  our  experiments,  the  real  color  component 
R  among  4  real  color  components  shows  dominant  contribution  for  face  color 
plausible  filtering.  Moreover,  RG  color  opponent  coding  features  also  show  a 
discriminate  characteristic  between  face  and  non-face  area.  Therefore,  in  the  proposed 
model,  only  the  real  color  component  R  and  RG  color  opponent  features  are 
considered  to  generate  a  skin  color  filter,  which  also  plays  a  role  for  reducing 
computation  time  as  well  as  getting  better  skin  color  filtering  performance. 

Actually,  considering  the  function  of  the  LGN  and  the  ganglian  cells,  we 
implement  the  on-eenter  and  off-surround  operation  by  the  Gaussian  pyramid  images 
with  different  scales  from  0  to  n- th  level,  whereby  each  level  is  made  by  the  sub- 
sampling  of  2",  thus  it  is  able  to  construct  four  feature  bases  such  as  the  intensity  (1), 
and  the  edge  (E),  and  color  (RG  and  BY)  [6,  8],  This  reflects  the  non-uni  form 
distribution  of  the  retina-topic  structure.  Then,  the  centcr-surround  mechanism  is 
implemented  in  the  model  as  the  difference  operation  between  the  fine  and  coarse 
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scales  of  the  Gaussian  pyramid  images  [6,  8].  Consequently,  three  feature  maps  are 
obtained  by  the  following  equations. 

I(C,S)  =\  1(c)  ©  /(.v)  I 

E(c,s)  =1  E(c)  ©  E(s)  I  (1) 

RG(t\s )  =1  R(c)  ©  C(c)  I  -  I  G(s)  ©  R(s)  I 

where  represents  interpolation  to  the  finer  scale  and  point-by-point  subtraction, 
c  and  s  are  indexes  of  the  finer  scale  and  the  coarse  scale,  respectively.  Features  are 
combined  into  three  feature  maps  as  shown  in  Eq.  (2)  where  /  ,  E  and  C  stand  for 
intensity,  edge,  and  color  feature  maps,  respectively.  These  are  obtained  through 
across-scale  addition  “  ®  ”  [6]. 


3  c+ 3 

/  =  ©  ®  N(I(c*s)) 

c=2 s=c+2 
3  c+y 

£  =  ©  ®  N(E(c,s))  (2) 

r= 2 v=f+2 
3  r+3 

C  =  ©  ©  N(RG(c,s)) 

c-2  s=c+2 

Thus,  the  three  features  maps  such  as/,  E  and  C  can  be  obtained  by  the  center- 
surround  difference  and  normalization  (CSD&N)  algorithm  |6],  A  SM  is  generated  by 
the  summation  of  these  three  feature  maps. 

The  salient  areas  are  obtained  by  selecting  areas  with  relatively  higher  saliency  in 
the  SM.  In  order  to  decide  salient  area,  the  proposed  model  generates  binary  data  for 
each  selected  face  candidate  area  using  Otsu’s  threshold  method  [10]  in  the  SM. 
Then,  the  proposed  model  makes  a  group  of  segmented  areas  using  a  labeling  method 


Rehna - ►  LGN  - ►  M  &  L  IP 


Fig.  2.  The  proposed  selective  attention  model  for  human  face  detecliom  r:  red,  g:  green,  b: 
blue,  R.  real  red,  G:  real  green,  1:  intensity,  E:  edge,  RG:  red-green  opponent  coding,  CSD&N  : 

center  surround  difference  &  normalization,  /  intensity  feature  map,  E  :  edge  feature  map, 
C  :  color  feature  map,  SM:  saliency  map 
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for  each  binary  face  candidate  area.  After  obtaining  the  candidate  salient  areas  for 
human  face,  the  obtained  face  candidate  areas  are  used  as  input  of  the  AdaBoost 
algorithm  [4] .  We  adopted  an  AdaBoost  approach  using  simple  Haar-like  features  as 
the  face  detection  algorithm  for  correctly  localizing  faces  in  the  face  candidate  regions 
selectively  selected  by  the  face-color  preferable  SM  model  [  1  ].  There  are  two  data 
sets  for  face  feature  extraction  and  learning  for  the  AdaBoost  model.  One  is  called  a 
positive  dataset  in  which  every  image  has  a  faee. 

The  other  for  non-faec  images  set  is  called  a  negative  dataset.  For  two  data  sets, 
Haar-like  features  are  extracted  in  order  to  select  the  proper  features  and  train  the 
AdaBoost  face  detection  model.  The  figure  2  shows  the  proposed  selective  attention 
model  for  human  face  detection. 

2.2  Incremental  Learning  of  Feature  Extraction  Using  IPCA 

In  the  IPCA  [2|,  an  eigen-feature  space  is  updated  through  two  operations: 
the  rotation  of  eigen-axes  and  the  dimensional  augmentation.  Assume  that  N 
training  samples  x  eRn(i  =  1,  -,jV)  have  been  presented  so  far,  and  an 

eigenspace  model £2  =  (.v,  {7,  A%  /V).  is  constructed  by  calculating  the  eigenvectors  and 

eigenvalues  from  the  covariance  matrix  of  ,V. ,  where  x  is  a  mean  input  vector.  U  is 

an  nxl  matrix  whose  column  vectors  correspond  to  the  eigenvectors,  and  A  is  an 
Ixl  matrix  whose  diagonal  elements  correspond  to  the  eigenvalues.  Here,  /  is  the 
number  of  dimensions  of  the  current  eigenspace.  Let  us  consider  the  case  that  the 
(/V+l  )th  training  sample  y  is  presented.  The  addition  of  y  will  lead  to  the  changes  in 
both  of  the  mean  vector  and  covarianee  matrix;  therefore,  the  eigenvectors  and 
eigenvalues  should  also  be  recalculated.  The  mean  input  vector  \  is  easily  updated  as 
follows: 


x 


1 

(N  + 1) 


(Nx  +  y). 


(3) 


The  problem  is  how  to  update  the  eigenvectors  and  eigenvalues.  When  the  eigenspace 
model  U  is  reconstructed  to  adapt  to  a  new  sample,  wc  must  check  whether  the 
dimensions  of  the  eigenspace  should  change  or  not.  If  the  new  sample  has  almost  all 
energy  in  the  current  eigenspace,  the  dimensional  augmentation  is  not  needed  in 
reconstructing  the  eigenspace.  However,  if  it  has  some  energy  in  the  complementary 
spaee  to  the  current  eigenspace,  the  dimensional  augmentation  cannot  be  avoided. 
This  can  be  cheeked  by  the  accumulation  ratio  whose  incremental  representation  is 
given  as  follows: 

i  _ 

N(N  +  1)^^  +  N  \\UT (y-x)W 

Ml)  = - *4 - 3 - •  (4) 

N(N  +  \)^A, i  +  NW  v-.v II2 
1=1 

It  A(l)  is  smaller  than  a  threshold  value  6 ,  a  new  eigen-axis  is  added  to  the  current 
eigenspace  along  the  residue  vector  h: 
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h  =  (y-x)-Ug  (5) 

Where 


g=UT(y-x). 


(6) 


It  has  been  shown  that  the  eigenvectors  and  eigenvalues  can  be  updated  based  on  the 
solution  of  the  follow  ing  intermediate  eigenproblem  [11]: 


N 

1 

O 

N 

ggT 

[(Af  +  D 

0r  ()_ 

(N+i)2 

_ j 

r  \) 

(7) 


where  y  =  hJ(y-x),  R  is  an  (/  4- 1)  X  (/  + 1)  matrix  whose  column  vectors  are  the 
eigenvectors  obtained  from  the  above  intermediate  eigenproblem,  A '  is  the  new 
eigenvalue  matrix,  and  0  is  an  /  -dimensional  zero  vector.  Using  /?,  we  can  obtain  the 
new  nx(l  + 1)  eigenvector  matrix  U  '  as  follows: 


where 


t/#=  [U,h]R 


if  A(l)<0 
otherwise 


(8) 


(9) 


Here,  6  is  a  threshold  value.  Figure  3  shows  a  general  How  in  the  incremental  feature 
extraction  using  1PCA. 
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Fig.  3.  The  incremental  feature  extraction  model:  IPCA  processing 

2.3  Resource  Allocating  Network  with  Long-Term  Memory 

When  training  samples  are  incrementally  given,  neural  networks  often  suffer  from  a 
well-known  phenomenon  called  catastrophic  interference  [12],  RAN-LTM  can 
alleviate  this  problem.  Figure  4  shows  the  architecture  of  RAN-LTM  which  consists 
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of  two  parts:  Resource  Allocating  Network  (RAN)  [13]  and  Long-Term  Memory 
(LTM).  RAN  is  an  extended  model  of  a  Radial  Basis  Function  (RBF)  network  in 
which  the  allocation  of  hidden  units  is  automatically  carried  out.  Let  us  denote  the 
number  of  input  units,  hidden  units,  and  output  units  as  /,  J,  K ,  respectively. 

Moreover,  let  the  inputs  be  x  =  { jCj ,  •  •  \  Xj  ,  the  outputs  of  hidden  units  be 

v  =  |  v,,***,  v7}r  *  and  the  outputs  be  z  =  { Zx , •  •  *,  ZK }T  •  The  calculation  in  the  forward 
direction  is  given  as  follows: 


where  c  j  =  { <’  y , , 


7-v 


=  exp 

Lsf 

(7  =  1  J) 

(10) 

V  J 

j 

-I 

y=i 

wkjyj+& 

(*=i 

(ii) 

J  and  (j~  are  the  center  and  variance  of  the  yth 

hidden 

unit,  Wkj  is  the  connection  weight  from  the jth  hidden  unit  to  the  Ath  output  unit,  and 

£k  is  the  bias  of  the  Ath  output  unit.  The  items  stored  in  LTM  are  called  ‘memory 

items’  that  correspond  to  representative  input-output  pairs.  These  pairs  can  beselected 
from  training  samples,  and  they  are  learned  with  newly  given  training  data  to  suppress 
forgetting.  In  the  learning  algorithm,  a  memory  item  is  created  when  a  hidden  unit  is 
allocated:  that  is,  an  RBF  center  and  the  corresponding  output  arc  stored  as  a  memory 
item  in  the  LTM.  The  learning  algorithm  of  RAN-LTM  is  divided  into  two  phases: 
the  allocation  of  hidden  units  (i.e.  incremental  selection  of  RBF  centers)  and  the 
calculation  of  connection  weights  between  hidden  and  output  units.  The  procedure  in 
the  former  phase  is  the  same  as  that  in  the  original  RAN,  except  that  memory  items 
are  created  at  the  same  time.  Once  hidden  units  are  allocated,  the  centers  are  fixed 

afterwards.  Therefore,  the  connection  weights  VV  =  j  w k.  J  arc  only  parameters  that 

are  updated  based  on  the  output  errors.  To  minimize  the  errors  based  on  the  least 
squares  method,  it  is  well  known  that  the  following  linear  equalities  should  be  solved  [14]: 


OVV  -  l) 


(12) 


where  D  is  the  matrix  whose  column  vectors  correspond  to  the  target  outputs. 
Suppose  that  a  training  sample  (jr,  d)  is  given  and  M  memory  items 
f  ~Zm  ^  (m  =  1  ,  •  -,M  )  have  already  been  created,  then  the  target  matrix  D  are 

formed  as  follows:  /)  =  j d  ,  z  \ ,  •  •  •,  z  «  *  Furthermore,  o  =  }  (/  =  l.  -  .Af  +1) 

calculated  from  the  training  sample  and  memory  items  as  follows: 


Pi,  =  exp 


-T 


-exp 


-d. 


=  i  =  l.— ,Af). 


(13) 


To  solve  W  in  Bq.  ( 1 3),  Singular  Value  Decomposition  (SVD)  can  be  used. 
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Fig.  4.  The  architeciure  of  RAN-LTM 


3  Experimental  Results 

Figure  5  shows  a  simulation  process  of  the  face  detection  model.  Only  the  AdaBoost 
algorithm  based  on  Haar-like  form  features  generates  some  wrong  face  detection 
results.  Fig.  5  (a)  shows  an  example  with  a  wrong  face  detection  case  which  caused 
by  considering  Haar-like  form  feature  only  in  an  intensity  image  by  the  AdaBoost 
algorithm.  In  this  case,  a  shirt  is  wrongly  detected  as  a  face  since  the  intensity 
distribution  in  a  shirt  looks  like  a  face.  The  problems  can  be  resolved  by  the  proposed 
model  using  face  candidate  areas  as  shown  in  Fig.  5  (b).  A  shirt  is  not  selected  as  a 
face  candidate  area  by  the  proposed  face-color  preferable  attention  model  as  shown  in 
Fig.  5  (c),  which  is  obtained  from  the  face-color  preferable  attention  model. 


(a)  (b)  (c) 

Fig.  5.  Comparison  of  face  detection  between  an  AdaBoost  algorithm  and  the  proposed  model; 
(a)  face  detection  result  by  the  AdaBoost,  (b)  face  color  preferable  SM  and  face  candidate  area, 
(c)  face  detection  result  by  the  proposed  model 

A  main  goal  of  the  proposed  model  is  to  reduce  the  time  for  face  detection  by 
restricting  the  searching  regions  using  the  selective  attention  model  before  conducting 
face  detection  by  the  AdaBoost.  As  shown  in  Table  1,  the  proposed  model  can 
successfully  find  human  faces  within  0.0539-0.2624  sec.  The  experiments  were 
conducted  for  530  facial  images  of  the  UCD  database  obtained  in  indoor 
environments  [15].  In  this  experiment,  we  utilized  the  computer  system  with  3.0GHz 
CPU  and  2Gbyte  RAM. 
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Table  1.  The  time  for  face  detection,  and  the  performance  comparison  between  the  proposed 
model  and  Adaboost 


Adaboost 

Proposed  Model 

Processing 

Saliency  Map 

None 

35.7  ms  -  60.8  ms 

Time 

Adaboost 

199.8  ms  -  263.9  ms 

7.75  ms  -  240.1  ms 

[ms) 

Total 

206.4  ms  ~  270.8  ms 

53.9  ms  ~  262.4  ms 

Performance 

True  Positive 

100% 

100% 

(%) 

False  Positive 

8.4% 

3% 

Figure  6  demonstrates  how  the  incremental  feature  extraction  using  IPCA  is 
conduced.  Fig.  6  (a)  shows  an  initial  set  of  nine  input  faces,  each  of  which  is  given  by 
a  gray- scale  image.  Fig.  6  (b)  shows  six  eigen-faces  (eigenvectors)  computed  by 
applying  PCA  to  the  initial  set  in  a  batch  learning  mode.  Since  an  eigen-feature  vector 
is  obtained  by  projecting  each  face  image  to  the  six  eigen-faces  in  Fig.  6  (b),  every 
high-dimensional  input  image  in  Fig.  6  (a)  is  reduced  to  a  six-dimensional  eigen- 
feature  vector.  Fig.  6  (c)  shows  three  sets  of  incrementally  given  data,  each  of  which 
consists  of  two  face  images.  After  applying  IPCA  to  these  sets  of  face  images,  the 
number  of  eigen-faces  is  increased  to  10  (i.e.,  10-dimensional  eigen-features  are 
extracted)  and  the  eigen-faccs  are  updated  as  shown  in  Fig.  6  (d). 


Fig.  6.  A  schematic  processing  How  in  the  incremental  feature  extraction  hy  IPCA;  (a)  nine 
gray-scale  input  images,  (b)  six  cigen-faces  (eigenvectors)  computed  by  PCA,  (c)  three  sets  of 
two  face  images  that  are  given  incrementally,  and  (d)  updated  eigen-faces  whose  number  is 
increased  to  10  by  applying  IPCA  to  the  three  sets  of  face  images. 


Table  2.  Performance  comparison  hetween  the  two  incremental  learning  models  for  personal 
authentication  systems 


Baseline  Model 

Proposed  Model 

(IPCA  with  NN  classifier) 

(IPCA  with  RAN-LTM) 

#  of  total  image 

Prior  Knowledge  face  image  :  3 

Incremental  Learning  face  image  :  197 

Success 

68 

182 

Fail 

132 

18 

Performance  (%) 

34  % 

91  % 
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Table  2  shows  the  comparisons  between  a  baseline  model  using  1PCA  with  the 
nearest  neighbor  (NN)  classifier  and  the  proposed  model  using  IPCA  with  RAN- 
LTM,  in  which  the  number  of  output  units  of  RAN-LTM  is  200.  As  shown  in  Table  2, 
the  proposed  model  successfully  works  as  an  incremental  personal  authentication 
system  without  serious  forgetting. 

4  Conclusions 

In  this  paper,  we  propose  a  new  approach  to  construct  an  adaptive  personal 
authentication  system,  in  which  the  system  includes  a  face  selective  attention, 
incremental  feature  extraction  by  IPCA  and  an  incremental  neural  classifier  called 
RAN-LTM.  The  face  selective  attention  model  not  only  successfully  localizes  the 
facial  areas  but  also  appropriately  rejects  non-face  areas.  The  proposed  model  is  based 
on  the  faee  color  related  features  in  order  to  generate  face  color  preferable  attention 
and  the  AdaBoost  algorithm  decides  whether  the  attended  region  contains  a  face 
characteristic.  To  learn  a  feature  space  incrementally,  we  adopt  IPCA  in  which  the 
feature  space  is  update  not  only  by  rotating  existing  eigen-axes  but  also  by  increasing 
the  number  (i.e.,  the  eigen-space  dimensions  are  increased)  based  on  the  accumulation 
ration.  To  adapt  to  the  evolution  of  the  feature  space,  an  extended  model  of  RAN- 
LTM  is  adopted  as  a  classifier,  and  we  used  an  efficient  way  to  reconstruct  RAN- 
LTM  after  updating  the  feature  space.  In  the  experiments,  we  verify  that  the  proposed 
incremental  learning  scheme  works  quite  well  and  the  test  performance  of  the 
classifier  is  improved  continuously  as  the  incremental  learning  stages  proceed. 

As  further  work,  we  are  planning  to  develop  an  embedded  system  for  personal 
authentication  based  on  facial  biometries  information,  and  we  should  test  the 
developed  system  for  larger  facial  databases.  Moreover,  wc  are  considering  more 
experiments  for  verifying  the  proposed  model  by  comparing  the  performance  of  the 
proposed  model  with  that  of  state-of-the-art  models. 
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Abstract.  This  paper  proposes  a  hybrid  method  for  face  recognition 
using  local  features  and  statistical  feature  extraction  methods.  First,  a 
dense  set  of  local  feature  points  are  extracted  in  order  to  represent  a 
facial  image.  Each  local  feature  point  is  described  by  the  keypoint  de¬ 
scriptor  defined  by  SIFT  feature.  Then,  the  statistical  feature  extraction 
methods,  PCA  and  LDA,  are  applied  to  the  set  of  local  feature  descrip¬ 
tors  in  order  to  find  low  dimensional  features.  With  the  obtained  low 
dimensional  feature  vectors,  we  can  conduct  face  recognition  task  effi¬ 
ciently  using  a  simple  classifier.  Through  computational  experiments  on 
benchmark  data  sets,  we  show  that  the  proposed  method  is  superior 
to  the  conventional  PCA  and  LDA  in  the  classification  performance.  In 
addition,  we  also  show  that  the  proposed  method  can  achieve  remark¬ 
able  improvement  in  the  processing  time  compared  to  the  conventional 
keypoint  matching  methods  proposed  for  local  features. 

Keywords:  Face  recognition,  Local  features,  Global  statistical  features, 

SIFT,  PCA  LDA. 

1  Introduction 

Face  recognition  has  attracted  significant  attention  1  in  recent  years  because  of 
its  wide  applications.  One  of  the  most  widely  used  methods  for  efficient  represen¬ 
tation  of  facial  images  is  the  statistical  feature  extraction  such  as  PCA  (principal 
component  analysis)  and  LDA  (linear  discriminant  analysis).  Through  analyz¬ 
ing  distributional  properties  of  a  set  of  facial  images,  these  methods  can  find 
low  dimensional  features  which  maximizes  specific  statistical  criteria.  Eigcnfacc 
method  [2],  which  is  based  on  PCA,  provides  low-dimensional  representation 
of  facial  images  that  minimizes  the  loss  of  information  in  the  sense  of  squared 
error.  Fisherface  method  [3],  which  is  based  on  LDA,  provides  low-dimensional 
representation  that  maximizes  discrepancies  among  different  classes. 
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Though  these  methods  can  give  highly  effective  dimension  reduction  proper¬ 
ties.  there  are  still  difficult  problems  that  should  be  considered.  The  statistical 
Eigen  ffice  and  Fisher  face  methods  consider  a  facial  image  as  a  vector  point  in 
a  high  dimensional  input,  spare,  and  they  focus  on  finding  distributional  struc¬ 
ture  of  whole  data  set .  Consequently,  the  conventional  statistical  methods  lack  of 
keeping  local  features  which  is  useful  for  discriminating  human  faces.  In  addition, 
the  facial  images  to  which  PCA  and  LDA  have  been  applied  are  usually  repre¬ 
sented  by  just  gray  level  intensity.  However,  human  visual  system  is  known  to 
use  more  sophisticated  local  feature  descriptors  such  as  gradient  and  orientation 
of  local  edges,  which  may  play  important  role  in  recognizing  face. 

To  overcome  these  restrictions  of  conventional  statistical  methods,  we  try 
to  utilize  local  features  for  representing  facial  images.  There  have  been  various 
studies  on  developing  local  feature  descriptors  which  are  robust  to  various  image 
transformations  such  jus  illumination  change,  rotation  and  scale.  Using  these 
local  feature  descriptors,  we  can  expect  robust  properties  to  local  changes  of 
images  such  as  occlusions.  One  of  the  most  successful  local  features  for  image 
data  is  SIFT  (scale  invariant  feature  transform),  which  is  developed  by  Lowe 
[4].  Using  feature  descriptors  defined  by  gradient  and  orientation  of  local  image 
patches,  Lowe  suggested  a  method  for  object  detection  through  extracting  a  set 
of  keypoints  from  each  image  and  matching  them  from  two  images  using  some 
invariant  properties  of  the  local  features  under  the  t\pical  transformation  such 
as  scale,  rotation,  and  translation  4]. 

However,  in  the  case  of  facial  recognition,  the  original  SIFT  method  does  not 
show  satisfiable  performance.  Due  to  a  lack  of  textures  in  facial  images,  original 
SIFT  cannot  detect  enough  number  of  keypoints  from  a  face  and  thus  represents 
the  whole  face  using  very  limited  number  of  local  features.  Moreover,  facial 
images  from  even  single  subject  have  diverse  variations  which  cannot  be  explicitly 
defined  using  mathematical  relationship  as  like  rotation,  scale  and  translation. 
This  characteristics  may  also  be*  a  cause  deteriorating  the  performance  of  original 
SIFT.  In  order  to  resolve  these  problems,  a  number  of  variations  of  original  SIFT 
have  been  proposed.  The  GRID-SIFT  method  which  was  studied  by  Bicego 
and  Luo  [5],  divides  facial  images  into  a  number  of  subregions  so  that  keypoint 
matching  can  be  done  in  the  corresponding  subregions.  Since  the  variation  of 
facial  images  does  not  include  the  translation  of  facial  part,  this  grid  makes  the 
matching  process  more  efficient.  However,  this  is  a  rudimentary  approach  and 
does  not  give  substantial  solution  to  the  typical  variational  properties  of  facial 
images.  On  the  other  hand,  the  dense  SIFT  method  has  been  developed,  which 
constructs  a  dense  set  of  keypoints  for  an  image  by  extracting  local  features  from 
fixed  locations  of  each  image  [6] [7].  Though  dense  SIFT  can  resolve  the  problem 
of  the  lack  of  keypoints,  it  is  very  costly  in  matching  process  because  of  the* 
extremely  large  number  of  keypoints  with  high  dimensional  descriptors. 

In  this  paper,  we  propose  a  combination  of  the  statistical  feature  extraction 
method  and  the  local  keypoints  deesriptors  in  order  to  compensate  their  weak 
points  and  to  augment  the  recognition  performance.  Based  on  the  dense  SIFT 
method,  we  represent  a  facial  image  using  a  dense  set  of  local  keypoints.  Then  we 
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apply  PCA  and  LI)A  to  the  high  dimensional  vector  composed  of  the  dense  set 
of  keypoints,  so  as  to  get  a  low  dimensional  feature  vector  which  is  statistically 
meaningful  and  efficient  in  matching  calculation.  By  using  local  features  instead 
of  gray  level  intensity  when  applying  PCA  and  LI)A.  we  expect  to  get  more 
useful  information  for  face  representation.  Also,  by  applying  statistical  feature 
extraction  to  the  set  of  local  keypoints,  we  expect  to  get  more  efficient  low 
dimensional  features  which  can  learn  the  statistical  variations  of  facial  images. 

In  the  next  section,  the  conventional  studies  on  local  features  (SIFT  and 
dense  SIFT)  for  face  recognition  are  briefly  reviewed.  In  Section  3,  the  proposed 
method  for  combining  local  feature  approach  and  statistical  feature  extraction 
is  described.  Some  experimental  results  on  benchmark  facial  data  set  are  given 
Section  in  4,  and  conclusions  are  made  in  Section  5. 

2  Local  Feature  Extraction  For  Face  Recognition 

In  this  section,  we  describe  the  conventional  local  features,  SIFT  and  its  modifi¬ 
cations  for  face  recognition.  There  are  three  issues  when  we  use  local  features  for 
face  recognition.  First,  we  need  to  determine  how  to  select  interesting  point  (i.e. 
keypoint)  from  an  image.  Second,  we  need  to  define  an  appropriate  descriptor 
for  the  selected  keypoints  so  that  it  can  represent  robust  local  properties  of  given 
images.  After  every  image  is  represented  by  the  set  of  keypoint  descriptors,  we 
need  to  measure  the  similarity  between  two  images.  In  the  local  approaches,  the 
similarity  is  measured  through  matching  each  keypoint  in  one  image  with  one  in 
the  other  image.  Once  the  similarity  is  measured,  we  can  conduct  classification 
process  using  simple  classifiers  such  its  K-nearest  neighbor.  In  this  section,  we 
briefly  explain  these  three  issues  on  SIFT  method  and  its  variations. 

2.1  Keypoints  Selection 

SIFT  [4]  uses  scale-space  Difference- Of- Gaussian  (DOG)  to  detect  keypoints 
in  images.  For  an  input  image,  I(x,y).  the  scale  space  is  defined  as  a  func¬ 
tion,  L{x,y,a)  is  produced  from  the  convolution  of  a  variable-scale  Gaussian, 
G(x\y,<r)  with  the  input  image.  The  DOG  function  is  defined  as  follows: 

D(x.  y,  cr)  =  (G(x, y,  ka)  -  G(x,  y ,  a))  *  /(r,  y) 

=  L{x,y,  ka)  -  L{x,y,a) 

where  k  represents  multiplicative  factor. 

The  local  maxima  and  minima  of  D{x,yso)  are  computed  based  on  its  eight 
neighbors  in  current  image  and  nine  neighbors  in  the  scale  above  and  below. 
From  the  obtained  local  maxima  and  minima,  keypoints  are  selected  based  on 
the  measures  of  their  stability  and  the  value  of  keypoint  descriptors  which  will 
be  described  below. 

In  face  recognition,  a  main  drawback  of  the  original  SIFT-based  keypoint 
selection  is  that  only  a  few  numbers  of  keypoints  are  extracted  due  to  a  lack  of 
textures  of  facial  image,  which  may  cause  low  performance  in  face  recognition. 
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Thus,  instead  of  the  original  keypoint  selection  method  proposed  by  Lowe  [4], 
local  feature  descriptors  are  extracted  at  regular  image  grid  points  so  as  to  give 
us  a  dense  description  of  the  image  content.  Such  modification  is  usually  called 
as  dense  SIFT  [8] [9].  The  dense  SIFT  was  first  developed  in  Dalai  and  Triggs  [6] 
for  pedestrian  detection.  Dreuw  [7]  have  proposed  to  use  the  dense  SIFT  features 
for  face  recognition  with  grid  matching  strategy.  We  will  also  use  this  approach 
in  the  proposed  combining  method. 


2.2  Keypoint  Descriptor 

Each  keypoint  extracted  by  SIFT  is  represented  as  a  descriptor  that  is  a.  128 
dimensional  vector  which  is  computed  as  a  set  of  orientation  histograms  in 
neighborhood  of  the  keypoint  location.  Each  orientation  histogram  has  8  main 
direction  which  contains  the  summarized  contents  over  4  by  4  subregions  by 
accumulation  of  gradient  magnitude  on  each  point.  The  gradient  magnitude 
rn(x,y)  and  orientation  0(x,y)  is  computed  in  Gaussian  smoothed  image 
which  lias  the  closest  scale  a  from  the  keypoint  scale.  The  explicit  computation 
of  the  magnitude  of  gradient  rn(x,  y)  and  the  orientation  O(.r.  y)  at  point  (x.  y) 
can  be  given  as 


m(.r.  y)  =  \/(L(x  +  !.«/)-  L{x  -  1 ,  y))2  +  (L(x.  y  +  I)  -  L(x.  y  -  l))2. 


0(x,  y)  =  tan  1 


[  /,(. r.  t/  +  1)  -  L(x,y  -  1_)1 
\L(.r+  1  ,y)~  L{x-  1  ,y)j 


(2) 

(3) 


In  the  original  SIFT  for  object  recognition,  the  gradients  are  aligned  to  the  main 
direction  for  obtaining  a  rotation  invariant  descriptors. 

In  order  to  apply  SIFT  to  face  recognition,  some  modifications  in  descriptors 
have  been  done  [7].  The  main  idea  of  the  modification  is.  if  face  detector  can 
provide  an  rotation-free  image,  descriptors  are  no  longer  needed  to  be  rotation 
invariant.  Moreover,  the  rotation  invariant,  descriptors  may  even  lead  to  false 
matching  correspondences.  Under  this  consideration,  Dreuw  [7]  proposed  to  use 
upright  version  of  the  SIFT  descriptor  for  face  recognition,  in  which  gradients 
of  descriptor  are  aligned  to  a  fixed  direction.  The  upright  versions  art1  faster  to 
compute  and  can  increase  accuracy. 


2.3  Kcypoints  Matching  and  Classification 

In  order  to  classify  an  image  data,  we  need  to  measure  the  similarity  between 
two  images.  To  measure  the  similarity  using  the  set  of  kcypoints,  we  first  have 
to  match  each  keypoint  in  an  image  to  one  in  the  other  image.  W  hen  we  list1 
the  original  SII^T  method  for  face  recognition,  all  possible  pair  of  key  points  are 
t  raveled  to  select  a  set  of  matching  pairs  wit h  sufficiently  similar  descriptors.  The 
similarity  of  two  images  are  then  calculated  as  the  number  of  selected  matching 
pairs.  However,  in  case  of  facial  images,  we  cannot  expect  satisfiable  performance? 
through  this  measure  with  just  local  matching.  Sometimes,  a  pair  of  kcypoints 
from  obviously  different  facial  area  (for  example,  one  from  left  eye  and  the  other 
from  upper  lips)  is  selected  as  a  matching  pair.  Since  the  number  of  keypoint.  in 
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a  facial  image  is  quite  small  as  we  mentioned  before,  these  mismatching  pairs 
often  results  in  wrong  classification  results.  To  avoid  this,  GRID-SIFT  method 
divides  an  image  into  a  number  of  subregion,  and  matching  is  allowed  when  two 
keypoints  are  from  the  same  subregion,  which  leads  slightly  better  performance. 
In  the  case  of  the  dense  SIFT,  the  same  matching  method  can  be  applied.  Since 
each  keypoint  is  obtained  from  a  fixed  location,  the  silly  mismatching  of  key- 
points  in  different  locations  can  be  avoided  to  some  extent.  However,  traveling 
all  possible  pair  of  keypoints  is  very  time  consuming  process.  Though  the  GRID 
approach  can  also  be  applied  to  speed  up  the  matching  process  [7],  it  still  needs 
high  computational  cost  compared  to  statistical  approaches. 

3  Combination  of  Local  and  Statistical  Feature 
Extraction 

In  order  to  solve  the  problem  of  high  computational  cost  of  the  dense  SIFT  and 
to  utilize  statistical  information  of  training  data  set,  we  try  to  combine  the  local 
features  with  the  statistical  feature  extraction  methods.  In  this  paper,  we  exploit 
two  well  known  statistical  methods:  PCA  and  LI) A. 

3.1  Statistical  Feature  Extraction 

PCA  tries  to  find  a  subspace  whose  basis  vectors  correspond  to  the  maximum- 
variance  directions  in  the  original  space,  so  as  to  minimize  information  loss 
caused  by  dimension  reduction  in  the  sense  of  squared  error.  Let  \\  represent 
transformation  matrix  that  provides  an  optimal  linear  transformation  from  the 
original  space  onto  a  subspace  [10].  The  new  feature  vectors  yt*  is  defined  as 
follows: 


(4) 


where  i  =  1 . N>  N  is  the  number  of  data. 


The  columns  of  W  are  the  eigenvectors  et  obtained  by  solving  eigenvalue 
decomposition 


(5) 


A iCi  — - 


where  E  is  the  covariance  matrix  of  train  data,  A,  is  the  eigenvalue  associated 
with  eigenvector  e> 

While  PCA  is  an  unsupervised  method,  LDA  utilizes  class  information  to  give 
maximum  class  discrepancy.  LDA  tries  to  find  a  subspace  in  which  the  ratio  of 
the  be  tween-class  scatter  to  the  within-class  scatter  Su  is  maximized.  When 
the  within-class  scatter  matrix  Su,  and  the  betwcen-class  scatter  Si,  are  given  by 


c 


(6) 


and 


<: 


St  =  -/i)T, 


(7) 
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the  columns  of  W  can  be  obtained  as  the  eigenvectors  of  S~l 2 3S/,.  Here,  x-j  is  the 
?!th  sample  of  class  j,  fij  is  the  mean  of  class  j.  r  is  the  number  of  classes,  and 
Nj  the  number  of  samples  in  class  j. 

These  methods  can  find  statistically  meaningful  low  dimensional  features 
through  learning  from  given  data  set,  which  cannot  be  obtained  by  local  ap¬ 
proaches  with  no  learning  process.  However,  they  are  basically  global  approaches, 
which  treats  an  image  as  a  vector  in  input  space,  and  the  obtained  features 
mainly  represents  global  shapes  of  faces. 

3.2  Proposed  Combination  Strategy 

In  this  paper,  we  try  to  combine  the  two  different  approaches  to  face  recognition: 
the  local  feature  matching  and  the  global  statistical  feature  extraction.  First,  we 
represent  an  image  vising  local  features.  By  using  local  features,  we  can  obtain 
more  abundant  information  from  an  image  than  by  using  the  simple  gray  level 
intensity.  Then  we  apply  statistical  feature  extraction  method  to  the  set  of  image 
data  represented  by  local  features.  Through  statistical  analysis  on  data  set.  we 
can  expect  to  obtain  low  dimensional  features  which  can  efficiently  represent 
diverse  variations  in  the  given  training  images. 

Figure  1  shows  an  illustrative  comparison  between  the  conventional  local  ap- 
proach  and  the  proposed  method.  In  the  case  of  local  approach,  the  whole  dense 
set  of  keypoint  are  directly  vised  for  measuring  similarity  between  a  test  image 
and  training  images.  It  is  obvious  that  the  computational  cost  for  recognizing 
a  test  image  increases  depending  oil  the  number  of  training  data,  as  well  as  on 
the  number  of  keypoinfcs.  In  the  case  of  proposed  method,  we  conduct  POA 
and  LDA  to  extract  low  dimensional  features  from  the  high  dimensional  local 
features.  Though  we  need  additional  learning  process  in  order  to  find  the  trans¬ 
formation  matrix  W,  the  cost  for  recognizing  a  test,  image  is  much  lower  than 
that  of  local  approach.  In  addition,  once  PCA  has  been  done  for  training  data 
set.  we  do  not  need  to  keep  the  dense  set  of  keypoint,  which  also  requires  large 
storage  resource.  In  addition  to  economy  of  the  computational  resource,  we  can 
also  expect  to  get  statistically  meaningful  feat  vires  representing  diverse  variations 
through  learning  of  training  images. 

In  the  followings,  the  detail  steps  of  the  proposed  method  are  given. 

1.  Let  train  face  images,  Ii ,  I2*  •  1  bv •  where  N  is  the  number  of  training  im¬ 

ages. 

2.  Apply  dense  feature  extraction  f(x)  on  each  images  to  obtain  the  matrix  of 
descriptors: 

D,  =/(!,)  (8) 

where  i  —  1.2 . N.  D  is  ci  x  m  matrix,  d  is  the  dimensionality  of  each 

descriptor,  and  m  is  the  number  of  key  points. 

3.  By  vectorizing  each  matrix  Df  (i  =  1, - Ar.  obtain  a  set  of  dm-dirnensional 

vectors.  X  =  {xi - ,  x;v }. 
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(a)  conventional  local  approach 


Train  Images 


Whole  train 
dense  descriptors 


Form  a  matrix 


Nearest  Neighbor 


Classified 
( b )  proposed  approach 


Fig.  1.  Aii  illustrative  comparison  between  conventional  local  approach  and  proposed 
method 


4.  Apply  PCA  or  LDA  for  X  and  get  the  linear  transformation  matrix  W. 

5.  Transform  each  X;  using  W  to  get  low  dimensional  features  yt  (i  =  1.2. 

C.  For  a  given  test  image,  obtain  low  dimensional  feature  t  through  step  2,3, 
and  5. 

7.  Classify  the  test  image  via  simple  nearest  neighbor  algorithm. 
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4  Experimental  Results 

4.1  AR  Database 

In  this  section,  wo  verify  the  efficiency  of  the  proposed  method  through  experi¬ 
ments  on  AR  database  11],  and  prov ide  comparisons  with  the  conventional  local 
approaches  and  the  conventional  statistical  methods.  The  AR  database  consists 
of  over  3,200  color  images  of  frontal  faces  from  12G  individuals:  70  men  and  5G 
women.  There  are  2G  different  images  for  each  person.  For  each  subject,  these 
were  recorded  in  two  different  sessions  separated  by  two  weeks  delay.  Each  ses¬ 
sion  consists  of  13  images  which  lias  differences  in  facial  expression,  illumination 
change  and  partial  occlusion.  In  this  experiment,  we  used  manually  aligned  im¬ 
ages  [10]  with  the  location  of  eyes.  After  localization,  faces  were  morphed  so  as 
to  fit  a  grid  of  size  85  by  GO.  Finally,  images  are  resized  to  88  by  G4  pixels.  A 
set  of  examples  from  one  subject  is  shown  in  Fig.  2.  The  first  and  second  row 
show  images  taken  at  first  session,  and  the  remaining  images  were  taken  at  the 
second  session. 


(n)  (O)  (P)  (q)  (0  <s)  (1) 


(u)  (v)  (w)  (x)  (y)  (z) 

Fig.  2.  Sample  images  for  one  subject  of  AR  database 


4.2  Experimental  Conditions  and  Results 

Using  AR  database,  we  compared  the  classification  performance  of  the  proposed 
method  with  a  number  of  conventional  methods:  PCA,  LDA,  SIFT,  and  dense 
SIFT  with  variation  in  matching  method.  We  used  the  open  source  implementa¬ 
tion  of  SIFT  and  dense?  SIFT,  which  is  implemented  bv  Vedaldi  and  Fulkerson 
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[12].  For  SIFT,  we  applied  the  original  matching  method  proposed  by  Lowe  [4]. 
which  was  briefly  described  in  Section  2.3.  For  dense  SIFT  (DSIFT),  features  are 
extracted  at  every  two  pixel  points  in  row  and  column  direction  from  each  im¬ 
age.  For  keypoint  matching  strategy,  we  tried  three  variations.  The  basic  DSIFT 
denotes  the  same  matching  strategy  as  that  of  the  original  SIFT.  Since  the  orig¬ 
inal  matching  strategy  tries  to  match  all  possible  pairs  of  keypoints  in  an  image, 
the  computational  cost  becomes  very  high  especially  in  the  ease  of  dense  SIFT. 
The  DSIFT  1-to-l  denotes  matching  a  keypoint  in  an  image  to  one  at  the  same 
location  of  other  image.  Since  DSIFT  1-to-l  matching  has  only  one  matching 
candidate  for  an  image  pair,  the  computational  cost  is  much  less  than  that  for 
the  original  DSIFT.  The  DSIFT  GRID  denotes  matching  keypoints  in  the  same 
sub-region.  The  D1SFT  GRID,  which  has  been  proposed  by  Dreuw  [7],  can  be 
considered  as  a  compromise  strategy  between  the  above  two  methods.  For  PC  A, 
we  take  the  eigenvectors  so  that  the  loss  of  information  is  less  than  1%.  and 
discard  fisrt  four  eigenvectors,  as  usually  done  in  application  of  PCA  for  face 
recognition.  For  LDA,  we  use  the  feature  set  obtained  through  PCA  for  avoiding 
small  sample  set  problem.  After  applying  LDA,  we  use'  maximum  dimension  of 
feature  vector  which  is  limited  to  the  number  of  classes.  For  DSIFT  PCA  and 
DSIFT  LDA,  the  same  strategies  as  PCA  and  LDA  are  taken.  As  the  comparison 
criteria,  we  used  the  mis-classification  rates  as  well  as  the  processing  time.  In 
order  to  show  the  relative  time  complexity  among  the  methods,  we  showed  the 
ratio  of  the  processing  time  for  each  method  to  the  time  for  PCA  method  (See 
Table  1  and  2). 

In  the  first  experiment,  we  used  only  noil-occluded  images  with  expression  and 
illumination  variations.  For  100  individuals,  seven  noil-occluded  images  taken  at 
the  first  session  (i.e.,  Fig.  2.  (a)^(g))  were  used  for  training,  and  the  remaining 
seven  lion-occluded  images  from  the  second  session  (i.e.,  Fig.  2.  (n)~(t))  were 
used  for  testing.  The  result  of  these  experiments  are  listed  in  Table  1.  Compared 
to  the  conventional  PCA  and  LDA,  we  can  see  that  the  proposed  method  (DSIFT 
PCA  and  DSIFT  LDA)  achieves  remarkable  improvement  in  error  rates.  The 
original  SIFT  shows  worst  result  as  we  can  expect.  Though  DSIFT  shows  the 
best  performances,  the  processing  time  for  single  testing  is  about  3600  times 
longer  than  the  proposed  method.  The  DSIFT  GRID  method  can  accelerate  the 
speed,  but  still  much  slower  than  the  proposed  method.  Compared  to  DSIFT 
1-to-l  method  that  shows  much  shorter  processing  time  than  DSIFT  GRID,  we 
can  see  that  the  proposed  method  provides  superior  results.  From  this,  we  can 
say  that  the  proposed  method  can  achieve  robustness  to  the  variations  in  the 
training  images  to  some  extent. 

In  the  second  experiment,  we  compared  the  performance  on  the  occluded 
images.  For  100  individuals,  three  lion-occluded  images  taken  at  the  first  session 
(i.e..  Fig.  2.  (a),  (c),  and  (g))  were  used  for  training,  and  four  remaining  non- 
occluded  image  and  six  occluded  image  from  the  first  session  (i.e.,  Fig.  2.  (b), 
(<!),  (c),  (0-  00  ~(in))  were  used  for  testing.  The  result  of  these  experiments 
are  listed  in  Table  2.  We  can  see  larger  deterioration  in  the  performance  of 
PCA  and  LDA  compared  to  the  first  experiment.  This  may  be  due  to  that  the 
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global  properties  of  the  statistical  method  is  not  proper  for  the  images  with 
occlusion.  Nevertheless,  the  proposed  combination  method  achieves  remarkable 
improvement  by  utilizing  local  features.  Like  the  case  of  first  experiment,  DSIFT 
shows  the  best  classification  rates  but  the  processing  time  for  only  single  test  is 
still  terribly  long.  From  these  results,  we  can  say  that  the  proposed  method  is  a 
reasonable  compromise  between  classification  rates  and  processing  time. 


Table  1.  Result  of  face  recognition  on  AR  database  with  time  delayed  variation 


strategy  (options) 

time  (r< 

dative  ratio) 

Error  Kate(%) 

number  of  features 

single  test 

learning  -f  test 

PC  A 

219 

1 .00 

LOO 

23.00 

LDA 

99 

0.90 

1.01 

15.86 

SIFT 

depending  on  image 

506.87 

14.09 

24.29 

DSIFT 

1120  x  128 

532914.60 

19290.77 

0.14 

DSIFT  l-to-I 

1120  x  128 

6473.95 

239.88 

9.29 

DSIFT  GRID  (7) 

1120  X  128 

31317.92 

1137.01 

0.29 

DSIFT  PCA 

568 

148.23 

14.32 

2  14 

DSIFT  LDA 

99 

148.13 

14.98 

0.43 

Table  2.  Result  of  face  recognition  on  AR  database  with  occlusion 


strategy  (options) 

time  (re 

lative  ratio) 

Error  Rate(%) 

number  of  feature 

single  test 

learning  -j-  test 

PCA 

133 

1.00 

1 .00 

57.10 

LDA 

99 

1.07 

1.14 

56.80 

SIFT 

depending  on  image 

256.31 

70.02 

56.80 

DSIFT 

1120  x  128 

255900.47 

67333.72 

0.00 

DSIFT  1-to-l 

1120  x  128 

7708.85 

2040.26 

29.20 

DSIFT  GRID  [7) 

1120  x  128 

20195.04 

5325. 1 1 

0.00 

DSIFT  PCA* 

252 

223.31 

77.24 

5.00 

DSIFT  LDA* 

99 

223.26 

77.62 

3.90 

5  Conclusions 

In  this  paper,  we  proposed  a  hybrid  approach  to  combine  local  features  and  sta¬ 
tistical  features.  By  using  local  features,  we  can  get  a  robust  representation  for 
image  data.  By  applying  statistical  feature  extraction  to  the  dense  set  of  local 
features,  we  can  find  efficient  low  dimensional  feature  vectors.  Since  the  uti¬ 
lization  of  local  features  and  learning  from  data  are  two  main  ability  of  human 
being,  which  plays  essential  roles  when  human  recognizes  some  objects,  the  pro¬ 
posed  hybrid  approach  can  be  considered  as  a  preliminary  approach  to  realizing 
machines  with  more  lunnan-like  visual  pattern  recognition  ability.  Throughout 
computational  experiment,  we  showed  that  the  proposed  method  maybe  a  rea¬ 
sonable  compromise  that  keeping  the  both  advantages  of  the  local  and  statistical 
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features.  This  is  a  preliminary  approach  to  combining  local  feature  and  global 
statistical  approaches,  and  other  sophisticated  methods  for  extracting  statistical 
features  could  be  applied  to  get  more  improvement  in  classification  performance. 
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Abstract.  Many  learning  techniques  of  Bayesian  network  have  been  developed 
for  adaptation  to  user  or  environment  However,  it  seems  several  drawbacks 
still  exists  in  conventional  learning  approach;  the  hardness  of  collecting  log 
data,  the  inherent  ambiguity  in  recognizing  and  reflecting  implicit  user  s  inten¬ 
tion,  and  difficulties  in  extracting  relations  between  data  or  definite  rules.  In 
this  paper,  we  propose  a  method  for  parameter  learning  in  Bayesian  network  us¬ 
ing  semantic  constraints  of  conversational  feedback  to  overcome  these  limita¬ 
tions.  Production  rules  extracted  from  users’  conversational  feedback  are  used 
in  parameter  learning  of  Bayesian  network.  A  comparison  test  with  conven¬ 
tional  approaches  in  conducted  to  verify  the  usefulness  of  the  proposed  method. 

Keywords:  Bayesian  network,  parameter  learning,  conversation,  semantic 
constraints. 


1  Introduction 

Bayesian  network  (BN)  is  a  graphical  model  to  represent  probabilistic  relationships 
among  a  set  of  variables.  The  nodes  represent  variables  in  the  DAG  (directed  aeyelie 
graph)  of  BN  and  the  directed  ares  represent  the  relationship  between  variables.  Over 
the  last  decade,  BN  has  become  a  popular  representation  for  encoding  uncertain  ex¬ 
pert  knowledge  in  expert  systems  [  1]. 

Two  methods  are  conventionally  applied  to  determine  the  parameters  of  a  BN:  The 
use  of  expert  knowledge  and  learning  from  data.  Determining  the  parameters  by 
the  use  of  expert  knowledge  has  the  advantage  of  reflecting  experts'  preference,  but 
the  proeess  is  difficult  and  time-consuming.  Furthermore,  it  is  unclear  whether  the 
network  designed  by  the  experts  is  really  the  most  appropriate  model  for  the  problem 
at  hand.  Therefore,  there  are  studies  which  statistically  learn  parameters  from  training 
data  12—4).  If  there  are  enough  training  data,  we  ean  get  the  proper  probability  of 
conditional  probability  table  (CPT)  using  machine  learning  and  statistical  methods 
such  as  maximum  likelihood  estimation  (MLE),  sequential  learning  [5],  EM  algo¬ 
rithm  [6],  Gibbs  sampling  [7 ),  and  importance  sampling  [8). 

However,  the  available  data  samples  arc  often  not  enough  when  putting  the  learn¬ 
ing  theories  into  practice  [9]  and  data  distributions  could  be  changed  over  time 


B  -T.  Zhang  and  M.A.  Orgun  (Eds.):  PR1CA1  2010.  LNA1  6230.  pp.  467-176.  2010. 
©  Springer- Verlag  Berlin  Heidelberg  2010 


468 


S,-H.  Lee,  S.  Lim,  and  S.-B.  Cho 


according  to  the  change  of  environment  or  users1  preferences.  Moreover,  data  samples 
from  real  world  could  have  missing  values  and  some  other  uncertainties.  The  conven¬ 
tional  machine  learning  methods  take  much  time  to  learn  models  from  such  data 
samples.  Furthermore,  they  are  hard  to  explicitly  consider  user’s  intention  and  are 
difficult  in  extracting  symbolic  relations  between  data  or  definite  production  rules. 

In  this  paper,  we  propose  a  direct  parameter  learning  method  for  Bayesian  network 
based  on  semantic  constraints  extracted  from  conversation  with  users.  Through  con¬ 
versation,  the  proposed  method  gets  user’s  feedback  and  generates  production  rules  in 
a  symbolic  representation.  These  rules  are  used  to  update  parameters  in  Bayesian 
network  in  order  to  directly  reflects  the  user’s  intention.  Compared  to  the  conven¬ 
tional  data-driven  learning  methods,  it  can  easily  and  instantly  adapt  to  new  environ¬ 
ment  and  a  user  without  a  long  period  of  observation.  Furthermore,  this  approach 
gives  user  a  chance  to  actively  modify  or  develop  its  own  probability  network  by  just 
saying  without  prior  knowledge  on  BN  nor  BN  experts. 


2  Related  Works 

2.1  Bayesian  Networks 

A  Bayesian  network  has  a  shape  of  DAG  (directed  acyclic  graph)  expressing  the  rela¬ 
tions  of  nodes  and  describes  a  large  probabilistic  relations  with  CPTs  (conditional 
probability  tables)  constrained  by  the  structure  such  as  Fig.  1 . 


APolludofiH).! 


ACancerlFollulionA  Smoker)=0.05 
ACanceriFVillulionA  ~Srnoker)=0.Q2 
ACaiKeii~PtilluuonA  Smoker)=0.03 
ACancerl-PnllutionA  -Smoker )=0.001 


,  Pi  XraylQuK'erM).9 
|  /ffirayl^Caiicer)=0.2 


1\  D>  spnoealCanccr  )=0.6S 
Pi  Dy  >pnocal~Cancer  )=0.30 


Fig.  1.  An  example  of  Bayesian  network.  The  conditional  probabilities  are  defined  in  the  box. 

Posterior  belief  of  unobserved  nodes,  Bel(h)  is  calculated  by  applying  Bayes’  Rule 
as  follows: 


Bel(h)=P(h\E)  = 


P(E\h)P(h) 

P(E) 


P(h  a  £) 

P(E) 


(1) 


where  h  is  the  hypothesis  of  a  node  state  and  E  represents  a  given  evidence  set  E.  The 
joint  probability  distribution  is  computed  by  Chain  Rule  as  (2) 
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P(XV)  =  P(X  i . X/f) 

=  P(X])P(X2  \  X{)P(X^\X{,X2U 


(2) 


where  pa{v)  denotes  the  set  of  parent  variables  of  variable  Xx  for  each  node  v£=  V. 

2.2  Conversational  Interface 

We  utilize  conversational  agent  described  in  [10)  in  order  to  enable  conversational 
interaction  with  users  for  extraction  of  semantic  rules,.  The  conversational  agent  is 
composed  of  two  parts:  topic  inference  module  using  probabilistic  network,  and  re¬ 
sponse  selection  module  using  keyword  matching. 

Topic  Inference  module:  Overall  user’s  intention  which  is  implied  in  conversation  is 
inferred  based  on  Bayesian  approach  in  this  module.  Context  of  possible  conversation 
is  modeled  using  BN  which  enables  effective  representation  of  relation  between  rec¬ 
ognized  token  and  its  corresponding  context.  These  relations  are  hierarchically  cap¬ 
tured  into  three  levels:  keyword,  concept,  and  topic  layer.  The  keyword  layer  consists 
of  words  related  to  topics  in  the  domain.  The  concept  level  is  composed  of  the  entities 
or  attributes  of  the  domain,  while  the  topic  level  represents  entities  whose  attributes 
are  defined. 

Response  Selection  module:  Proper  pattern-response  is  selected  by  applying  keyword 
matching  technique  according  to  the  recognized  current  context  in  this  module.  Key¬ 
word  matching  is  a  procedure  of  searching  the  knowledge  based  associated  with  the 
topic  of  conversation  When  there  are  many  scripts,  performance  of  keyword  match¬ 
ing  declines  because  of  the  time  required  to  traverse  massive  information  space.  Con¬ 
versational  agent  divides  its  knowledge  base,  scripts,  into  several  concept  so  that  it  is 
able  to  keeps  scalability  and  portability  which  are  important  for  flexible  reaction  to 
the  various  situation.  This  significantly  reduces  the  number  of  scripts  to  be  compared. 
Each  script  is  stored  in  XML  format.  A  set  of  candidate  scripts  are  sequentially 
matched  to  Find  an  appropriate  response.  The  matching  scores  are  calculated  by  the  F- 
measure,  which  is  a  popular  measurement  in  text  classification.  When  there  is  a  corre¬ 
sponding  pattern-response  pair,  language  generation  is  used  to  generate  the  answer. 

3  Direct  Parameter  Learning  Method  of  Bayesian  Network 

In  this  section,  we  describe  how  BN  parameters  can  be  directly  learned  from  interac¬ 
tion  with  user.  The  description  is  divided  into  two  parts:  extracting  production  rules 
from  conversation,  and  converting  rules  into  BN  parameters.  The  following  section 
explains  the  detailed  mechanism  to  achieve  conversation-based  learning. 


3.1  Extract  Production  Rules  from  Conversation 

A  user  queries  conditions  and  desirable  results  to  conversational  agent  when  abnormal 
or  unwanted  services  arc  provided  or  whenever  user  wants  to  teach  its  own  system. 
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Conversational  agent  analyzes  user’s  feedback  and  extracts  semantic  information  for 
generating  production  rules  from  the  conversation.  However,  it  is  not  simple  to 
control  the  semantic  information  in  a  form  of  natural  language  which  may  contain 
complex  meaning.  In  this  paper,  symbol  based  representation  is  adapted  to  manipulate 
and  maintain  information  and  design  a  language  model  which  is  defined  as  a  BNF 
grammar  to  produce  the  formal  descriptions  of  symbols  in  any  domain  as  shown  in 
Table  1. 


Table  1.  BNF  description  of  the  proposed  language  framework 


Non-terminal 

Predicates 

<Production-rule-description> 

::= 

IF  <Pattem>  THEN  <Response> 

<Pattcm> 

::= 

<Symbol-sequence>+ 

<Symbol-sequence> 

1 

1 

<Single-symbol>  1  not  <Symbol- 
scquence> 

( <Sequential-symbols> ) 
(<Simultaneous-synibols>) 

1 

(<Domain-specific-symbols>) 

<Sequcntial -symbol  s> 

<Single-symbol>  then  <Symbol- 
sequenco 

<Simultaneous-symbols> 

1 

<Single-symbol>  and  <Symbol 
sequence> 

<Single-symbol>  or  <Symbol-scquence> 

<Domain-speciftc-symbols> 

1 

1 

<Singlc-symbol> 

<Domain-speciflc-operator> 

<Symbol-sequence> 

<Single-symbol> 

::= 

<Valuc>  1  null 

<Value> 

::= 

Symbol-name 

1 

Domain-specific-characteristic 

<  Domain-spccific-opcrator> 

::= 

Domain-specific-operator-namc 

<Response> 

:;= 

<Sinelc-symbol>+ 

The  language  model  involves  symbols  and  inferential  rules  which  models  the  rela¬ 
tions  between  symbols,  and  associates  reasoning  with  the  manipulation  of  the  sym¬ 
bolic  descriptions.  A  symbol  has  its  own  value,  while  the  inferential  rule  is  composed 
of  the  input  patterns  of  symbols  and  the  output  sy  mbol  for  replacement.  The  language 
basically  describes  the  occurrence  of  symbols  by  means  of  concurrent  and  sequential 
relations  [11]. 

A  production  rule  is  a  sequence  of  one  or  more  symbol  sequences.  A  symbol  se¬ 
quence  consists  of  sequential  and  simultaneous  symbols.  Sequential  symbols  occur 
one  after  the  other  in  the  order  indicated  by  the  sequence,  while  simultaneous  symbols 
occur  in  parallel.  The  single  symbol  clause  contains  the  basic  information  associated 
with  the  symbol  extracted  from  conversation  in  a  relevant  domain.  It  consists  of  a 
unique  symbol  name  and  its  various  characteristics  with  respect  to  the  application 
such  as  duration  and  intensity.  In  the  language  framework,  the  ‘not’  tag  indicates  that 
the  symbol  sequence  should  not  include  in  the  input.  Especially,  for  the  generality  of 
the  language  framework,  it  allows  to  define  a  domain  specific  relation  between  sym¬ 
bols.  New  relationship  such  as  relational  reasoning  and  arithmetic  operation  can  be 
defined  according  to  domain. 
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Since  the  conversational  information  is  semantically  captured  in  the  form  of  pro¬ 
duction  rules,  it  requires  a  way  to  train  the  system  using  natural  language.  In  our 
method,  symbols  are  mapped  to  the  corresponding  words,  while  operators  such  as 
‘and'  and  ‘or’  are  modeled  with  predefined  templates.  A  query  Q  from  the  user  is 

tokenized  into  a  sequence  of  words  W  =  {vr,,  vv*2 . J  by  the  lexical  analysis.  The 

pattern  of  W  is  analyzed  by  matching  with  predefined  templates*  which  is  designed 
based  on  the  language.  We  implement  several  functions  that  produce  a  part  of  produc¬ 
tion  rules  as  follows. 

^niic($symbol,,  $symbol2)  “IF  Ssymbol,  THEN  Ssymbob” 

Fihcn($symbol|,  $symbol2)  4  “Ssymboh  then  Ssymbob” 

Fand($symbol|,  Ssymbob)  “$symbol|  and  Ssymbob” 

For($symbol|,  $symbol2)  “Ssymboh  or  Ssymbob” 

Fn(>t($synibol1)  ->  4  not  Ssymbol,  ” 

FSpecific($s*ymbol|,  Ssymbob)  “Ssymbol,  Domain -specific-operator  Ssymbob” 

A  number  of  templates  are  designed  to  implement  a  flexible  dialogue  for  learning 
knowledge.  Table  2  shows  some  examples  of  templates  for  knowledge  learning.  We 
can  learn  a  new  symbol  based  on  these  templates  and  find  out  the  relations  between 
each  symbol. 

Table  2.  Templaies  defined  for  knowledge  learning  from  conversation 


Template  1 

IF  Ssymbol]  ‘is'  Ssymbob  ‘and'  Ssymboh 

THEN  Fmlc(Fand(Ssymbob,  Ssymbol 3),  Ssymbol,) 

IF  (Ssymboh  and  Ssymbol d  THEN  Ssymbol | 

Template2 

IF  Ssymboh  ‘is  a  sequence  of  $symbol2  ‘and'  Ssymboh 

THEN  Fndc(Flhen(  Ssymbob,  Ssymbol  3),  Ssymbol  j) 

IF  (Ssymboh  then  Ssymbol  3)  THEN  Ssymbol, 

Template3 

IF  ‘if  Ssymbob  ‘is  oeeurred  after'  Ssymboh  ‘then'  Ssymbol,  ‘is 
activated' 

THEN  Fm,e(/rt,H.n($symbob,  Ssymboh),  Ssymboh) 

IF  (Ssymbol 3  then  Ssymboh)  THEN  Ssymboh 

Templaie4 

IF  Ssymboh  'is  true  if  Ssymbob  'is  false’ 

THEN  /■ruk.(Fnot( Ssymbob),  Ssymboh) 

IF  (not  Ssymboh)  THEN  Ssymbol, 

Template5 

IF  Ssymbol,  ‘is  the  sum  of  Ssymbob  ‘and’  Ssymboh 

THEN  Fmie(FsPecific( Ssymbob*  Ssymboh),  Ssymbol,) 

IF  (Ssymbob  sum  Ssymboh)  THEN  Ssymbol, 

Tempi  ate6 

IF  ‘if  a  person'  Ssymbob  ‘then’  Ssymboh  ',  she/he'  Ssymboh 

THEN  Fruic(Fthcn( Ssymbob,  Ssymboh),  Ssymboh) 

IF  (Ssymbob  then  Ssymboh)  THEN  Ssymbol, 

3.2  Learning  Parameters  of  Bayesian  Network  Based  on  Semantic  Constraint 

In  order  to  keep  the  simplicity  of  learning  problem,  we  restrict  the  problem  space  as 
below.  First,  every  node  in  networks  has  maximum  two  states.  Even  if  a  node  .t  has*  n 
(n>2)  states,  we  can  split  one  node  into  n  nodes  with  two  states  which  enables  or 
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disables  the  states  in  the  original  node  a.  Second,  wc  assume  that  we  already  know  the 
casual  dependencies  between  variables  in  networks.  All  semantic  relations  extracted 
from  conversation  is  already  expressed  as  an  arc  between  two  variables  in  Bayesian 
network.  For  example,  if  we  get  semantic  rule,  x  ->  y,  the  structure,  an  arc  from  x  to  y 
is  captured  in  the  network. 

The  generated  production  rules  (or  semantic  constraints)  in  Section  3.1  are  used  for 
direct  learning  parameters  of  Bayesian  networks.  The  learning  mechanism  is  com¬ 
posed  of  three  steps.  The  first  step  is  to  find  the  Markov  blanket  Mx  of  a  node  a  in  a 
Bayesian  network  and  the  next  step  is  to  create  a  truth  table  about  the  nodes  in  Mx 
U  {v}  based  on  semantic  constraints.  The  last  step  of  learning  parameters  is  to  create 
a  conditional  probability  table  (CPT)  of  the  node  a  using  the  truth  tabic  created  in  the 
second  step. 


Fig.  2.  A  Markov  blanket  of  a  node  A 

In  1996,  Roller  and  Sahami  [12]  proposed  a  cross-entropy  based  technique,  known 
as  Markov  blanket  for  identifying  redundant  and  irrelevant  features.  As  shown  in 
Fig.  2,  the  Markov  blanket  for  a  node  a  in  a  Bayesian  network  is  the  set  of  node  Mx 
composed  of  a’s  parents,  its  children,  and  its  children’s  other  parents.  Formally,  let  B 
be  a  set  of  nodes  which  composes  a  Bayesian  network  and  Mx  be  a  subset  of  nodes 
which  does  not  contain  the  node  a,  i.c.,  MXU  tfand  x£  Mx.  Afv  is  a  Markov  blanket  of 
the  node  x  if  x  is  conditionally  independent  of  a  distinct  node  y  (y£  Mx  and  y^x) 
given  MX9  i.e.  P(  x  1  MXf  y  )  =  P(  x  I  Mx ). 

The  Markov  blanket  of  a  node  contains  all  the  variables  that  shield  the  node  from 
the  rest  of  the  network.  This  means  that  the  Markov  blanket  of  a  node  covers  the 
range  of  knowledge  needed  to  predict  the  behavior  of  the  node.  When  parameters  of 
node  x  is  learned,  therefore,  we  only  limit  the  range  to  be  updated  as  Mx  instead  of  all 
the  nodes  in  B. 

After  finding  Markov  blanket  Mx  corresponding  node  a,  the  suggested  method  gen¬ 
erate  a  truth  table  depending  on  the  values  of  node  x  and  nodes  in  Mx .  The  truth  table 
is  used  to  figure  out  whether  the  production  rules  from  Section  3.1  are  fully  satisfied. 
To  generate  the  truth  table,  we  regard  each  production  rule  as  a  proposition,  and  mark 
T  on  the  truth  table  if  the  proposition  or  its  contraposition  are  satisfied,  F’  if  the 
proposition  is  not  satisfied,  and  4X‘  the  other  cases.  For  example,  the  truth  table,  as 
shown  in  Table  3,  can  be  constructed  corresponding  the  production  rule  “AfJJ  {a}  = 
{a,  y},  if  a=1  then  y=r. 
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Table  3.  Truth  table 


A 

y 

a-1  v=l 

0 

0 

T 

0 

1 

X 

I 

0 

F 

1 

1 

T 

If  (\\  y)  =  ( 1,  1 ),  the  production  rule  “if  a=I  then  y=l  ’  is  satisfied  so  table  value  is 
* T .  If  (.v,  v)  =  (0,  0),  the  contraposition  of  the  production  rule  “if  y=0  then  x=(T  is 
satisfied  so  table  value  is  ‘ T .  Also  if  (a,  y)  =  (1,  0),  both  the  production  rule  and  its 
contraposition  are  not  satisfied  so  table  value  is  'F9  If  (a,  y)  =  (0,  1),  both  the  produc¬ 
tion  rule  and  its  contraposition  cannot  be  justified  by  the  condition  so  table  value 
is  7f\ 

Conversational  agent  can  contains  multiple  production  rules.  When  many  produc¬ 
tion  rules  are  given,  the  final  truth  table  is  generated  as  follows.  Let  C,  be  the  ith  com¬ 
bination  values  of  the  truth  table  with  Mx U  {a},  ^e  trut^  ta^e  value  of  C,  takes  the 
value  k7"  if  and  only  if  there  are  one  or  more  production  rules  that  are  satisfied  the 
condition  C,  and  there  is  no  production  rules  which  is  not  satisfied  the  condition  C(.  It 
takes  the  value  ‘P  if  and  only  if  there  are  one  or  more  production  rules  that  are  not 
satisfied  the  condition  C,.  Otherwise  it  takes  the  value  ‘ X '  which  means  every  produc¬ 
tion  rule  cannot  be  justified  on  the  condition  C,. 

Table  4.  Probability  distribution  table 


A 

V 

Distribution 

0 

0 

0.9 

0 

1 

0.5 

1 

0 

0.1 

l 

1 

0.9 

In  order  to  assign  specific  values  to  the  CPT  of  the  node  a,  the  data  distribution  or 
density  table  is  needed  rather  than  the  truth  table.  Hence,  we  change  the  truth  table  to 
data  distribution  table  by  assigning  specific  values  according  to  the  value  of  the  truth 
table,  a  is  assigned  to  the  value  of  T\  (1  -  a)  to  the  value  of  and  0.5  to  the  value 
of  kX\  Using  this  data  distribution  table,  finally,  the  value  of  CPT  is  determined. 
For  instance.  Table  4  shows  the  generated  data  distribution  table  using  the  result 
of  Tabic  3  in  a  setting  of  a  as  0.9.  The  probability  P(.v=llv=l)  can  be  calculated  as 
follows: 


p  (A=  1  ly=  1 )  =  D( a=  1 ,  y=  1 )  /  (  D( x=  1 ,  v  - 1 )  +  D(a=0,  y=  1 )  ) 


=  0.9  /  (0. 9+0.5)  U  0.64286 

where  D(x=a,  y=b)  denotes  the  value  of  the  data  distribution  table. 


474 


S.-H.  Lee,  S.  Lim,  and  S.-B.  Cho 


4  Experimental  Results 

We  evaluate  the  proposed  conversation-based  parameter  learning  algorithm  (algo¬ 
rithm  1)  with  in  smart  home  environment.  In  addition,  the  results  are  compared  with 
two  other  algorithm,  one  is  a  conventional  data-based  learning  method  (algorithm  2) 
and  another  is  a  conversation-based  approach  with  fixed  learning  weight  for  user 
input  (algorithm  3),  in  order  to  validate  our  approach. 


Fig.  3.  Bayesian  network  of  smart  home  agent  designed  for  experiment 

We  design  a  BN  module  for  smart  home  management  as  described  in  Fig.  3.  It  re¬ 
flects  the  relationship  between  home  appliance,  home  status,  and  outdoor  environ¬ 
ment  whose  aim  is  specifically  for  controlling  air  conditioner  and  window.  The 
experiments  were  carried  out  by  using  dataset  of  possible  situations  stochastically 
generated  according  to  our  artificial  home  environment.  Dataset  has  2,000  situations 
of  home  status  which  half  of  them  is  used  to  learn  parameters  of  BN  and  remaining 
data  is  for  testing  its  accuracy.  Accuracy  of  leaned  model  is  evaluated  by  comparing 
dominant  status  according  to  the  posterior  belief  with  the  solution  contained  in 
dataset. 


Table  5.  Conversational  input  from  user  in  a  form  of  natural  language 


Constraint  Dialogue 

Cl  "It’s  hot  inside,  and  cool  outside.  Open  up  the  window.” 

£2  "It's  really  hot  today,  and  I  can't  stand  hot  like  this.  Turn  on  the  air  condi¬ 

tioner." 

c  ~  "It's  really  cold  today,  I'm  feeling  cold  inside.  Why  don't  we  close  the 

window?" 

C4  "It's  freezing  and  I'm  feeling  cold.  Close  the  window." 

C5  "Air  conditioner  is  on.  Close  the  all  windows." 

C6  "It’s  raining  hard.  Close  the  window." 
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User  input  through  conversation  is  applied  for  learning  as  described  in  Table  5.  It 
shows  the  simple  dialogues  accordance  with  a  possible  situation  which  contains  heu¬ 
ristic  rules.  These  rules  are  learned  by  conversational  agent  mainly  using  the  form  of 
template  1  in  Table  5.  In  this  experiment,  we  assume  all  semantic  rules  are  applied  to 
BN  model  in  a  specific  time  rather  than  incremental  application  along  the  time  line. 
For  instance,  parameters  calculated  from  all  conversational  input  are  combined  with 
parameter  sets  of  BN  after  learning  by  50,  100,  200,  300,  and  400  data  set,  respec¬ 
tively  in  algorithm  1  and  3. 


Fig,  4.  Variations  in  accuracy  for  different  level  of  observations 


Table  6.  Overall  accuracy  of  three  different  learning  method 


Learning  method 

Window 

Air  conditioner 

Overall 

Data  based  learning 

69.7% 

78.7% 

66.9% 

Conversation  &  Data  with  fixed 
learning  weight 

87.4% 

97.0% 

84.7% 

Conversation  &  Data  with  varying 
learning  weight 

88.9% 

97.4% 

86.5% 

The  results  for  three  different  learning  algorithms  are  drawn  in  Fig.  4.  (a),  (b)  and 
(c)  are  the  result  of  inferred  status  of  air  conditioner,  window  and  both,  respectively. 
We  can  see  all  three  algorithms  successfully  adapts  the  environment  after  enough  data 
sets  has  been  observed.  However,  there  are  huge  gaps  between  algorithm  2  and  algo¬ 
rithm  1  and  3  at  the  early  stage.  Due  to  the  inherent  feature  of  data-driven  learning  of 
BN,  it  is  almost  impossible  to  reflect  environment  exactly  when  data  sets  are  small. 
Whereas,  proposed  conversational  approach  overcome  this  limitation.  It  shows  defi¬ 
nitely  good  performance  is  presented  before  observation  of  100  data  set  because  of  its 
instant  and  direct  learning  from  conversation.  We  confirm  proposed  method  enables 
the  system  to  capture  features  of  given  environment  quickly  with  low  effort  without 
domain  experts  through  the  interaction  with  human. 

Moreover,  we  can  see  algorithm  1  helps  the  system  keep  slightly  more  accuracy 
than  others  even  when  learning  is  mature.  This  means  addictive  learning  from 
conversation  procedures  can  possibly  leads  more  reliable  agent  system  not  only  in 
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the  initial  phase  but  also  in  the  stable  phase.  This  is  also  supported  by  Tabic  6 
which  shows  overall  aeeuraey  during  the  learning  phase  for  eaeh  algorithm.  Here, 
algorithm  3  has  lower  aeeuraey  than  algorithm  1.  We  ean  see  it  is  important  to  eontrol 
learning  weight  in  the  use  of  conversational  based  learning. 


5  Conclusions  and  Future  Works 

In  this  paper,  we  proposed  a  direct  parameter  learning  method  for  Bayesian  network 
from  the  conversation  with  users.  We  defined  functions  and  templates  in  order  to 
extraet  semantic  information  from  natural  language  and  designed  a  language  to  facili¬ 
tate  a  representation  of  relation  between  information  in  a  symbolic  description.  We 
developed  a  novel  learning  method  which  includes  meehanism  of  converting  semantic 
rules  into  probability  density  for  updating  CPT.  By  applying  this  method,  system  can 
be  easily  adapted  to  new  environment  or  a  user  without  collecting  mueh  data  and  also 
keep  high  level  of  reliability.  In  addition,  it  gives  user  a  ehanee  to  develop  its  own 
probability  network  that  does  not  have  any  prior  knowledge  on  BN.  As  the  future 
work,  we  will  extend  the  proposed  method  to  structure  learning  of  BN  and  apply  to 
the  ease  whose  constraints  are  distributed  in  a  time  line. 
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Keystroke  Dynamics  Extraction  by  Independent 
Component  Analysis  and  Bio-matrix  for  User 

Authentication 
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Abstract.  Keystroke  dynamics  is  unique  specific  characteristics  used 
for  user  authentication  problem.  There  are  many  researches  to  detect 
personal  keystroke  dynamics  and  authenticate  user  based  on  these  char¬ 
acteristics.  Most  researches  study  on  either  the  key  press  durations  and 
multiple  key  latencies  (typing  time)  or  key-pressed  forces  (pressure- based 
typing)  to  find  the  owned  personal  motif  (unique  specific  characteristic). 
This  paper  approaches  to  extract  keystroke  dynamics  by  using  indepen¬ 
dent  component  analysis  (1CA)  t  hrough  a  standardized  bio-matrix  from 
typing  sound  signals  which  contain  both  typing  time  and  typing*  force  in¬ 
formation  The  ICA  representation  of  keystroke  dynamics  is  effective  for 
authenticating  user  in  our  experiments.  The  experimental  results  show 
that  the  proposed  keystroke  dynamics  extraction  solution  is  feasible  and 
reliable  to  solve  user  authentication  problem  with  false  acceptance  rate 
(FAR)  4.12%  and  false  rejection  rate  (FUR)  5.55%. 

Keywords:  Behavioural  biometrics,  keystroke  dynamics,  pattern  recog¬ 
nition,  independent  component  analysis  (ICA).  user  authentication. 


1  Introduction 

Keystroke  dynamics  is  detected  from  user  characteristics  based  on  how  lie  types 
on  the  keyboard  but  not  what  he  types.  Keystroke  dynamics  captures  typing 
characteristics  such  as  key  press  duration  ‘dwell  time'  when  typing  and  digraphs 
or  serigraphs  times  -  the  latency  between  striking  successive  keys.  All  attributes 
of  user  extracted  from  typing  are  linked  to  user's  profile  through  learning  ma¬ 
chine  system.  They  art'  used  to  verify  user  by  detecting  his  typing  characteristics 
in  the  next  time,  hi  t Ik*  previous*  report  on  keystroke  dynamics,  the  character¬ 
istics  are  analyzed  in  novel  concept  with  long  text  (see  [l].  [2]).  Almost  recent 
publication,  keystroke  dynamics  can  be  retrieved  in  shot  text  input  concept  like 
user  II)  and  password.  Various  algorithms  and  methods  are  researched  to  apply 
for  authenticating  keystroke  dynamics,  such  as:  fuzzy  algorithms  [3],  neural  net¬ 
work  -  support  vector  machine  [4]  and  multiple  sequence  alignment  [5].  Besides 
the  method  based  on  typing  time  method,  the  pressure-based  typing  method  is 
proposed  ([6],  [7]).  All  publications  prove  that  keystroke  dynamics  can  be  used 
to  improve  security  like  physiological  methodologies. 

All  above  publications  approach  to  detect  keystroke  dynamics  based  on  ei¬ 
ther  typing  time  or  typing  force.  Our  approach  uses  both  characteristics  to  solve 

13. -T.  Zliang  and  M  V  Organ  (Ed*.):  PRICAI  2010.  LNAI  (>230,  pp.  177  lN(i.  2010 
(r)  Springer- Wring  Berlin  Heidelberg  2010 
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the  user  authentication  problem.  In  this  paper,  we  propose  an  indirect  method 
to  detect  key-pressed  time,  key-released  time  and  key-typed  forces  by  analyzing 
sound  signals  created  when  typing  on  keyboard.  Fig.  1  summarizes  our  proposed 
user  authentication  method’s  process.  Keystroke  dynamics  characteristics  are  re¬ 
trieved  from  sound  signals  by  using  a  sound  recorder.  In  pre-processing  phase, 
typing  sound  signals  containing  both  characteristics  are  standardized  and  trans¬ 
lated  to  a  keystroke  dynamics  bio-rnatrix.  The  keystroke  dynamics  bio-matrix 
represents  the  unique  characteristics  of  user’s  typing  habit.  To  extract  keystroke 
dynamics  feature,  independent  component  analysis  (ICA)  method  is  applied. 
ICA  is  a  recently  developed  method  in  which  the  goal  is  to  find  a  linear  repre¬ 
sentation  of  lion-Gaussian  data  so  that  the  components  are  statistically  indepen¬ 
dent,  or  as  independent  as  possible.  Such  a  representation  seems  to  capture  the 
essential  structure  of  the  data  in  many  applications,  including  feature  extrac¬ 
tion  and  signal  separation  [8].  Face  recognition  [9]  and  facial  feature  extraction 
[10]  are  examples  using  ICA.  In  this  paper,  we  use  the  ICA  second  architec¬ 
ture  described  in  [9]  to  extract  keystroke  dynamics  features  from  the  bio-matrix. 
Experimental  results  show  that  our  approach  using  the  keystroke  dynamics  bio¬ 
matrix,  ICA  extraction  method  and  neural  network  (Fast  Artificial  Neural  Net¬ 
work  Library  -  FANN  [16])  for  recognition  is  feasible  and  reliable  to  solve  user 
authentication  problem  with  4.12%  FRR  and  5.55%  FAR. 

The  remainder  of  this  paper  is  organized  as  follows  In  section  2,  we  present 
about  an  overview  of  the  solution,  pre-processing  phase  and  the  keystroke  dy¬ 
namics  bio-matrix.  Section  3  describes  the  architecture  II  to  apply  ICA  method 
for  extracting  keystroke  dynamics  from  the  bio-matrix.  Experimental  results  of 
the  solution  combining  bio-matrix,  ICA  extraction  method  and  FANN  are  re¬ 
ported  in  section  4.  Conclusion  and  future  works  will  be  mentioned  in 
section  5. 


2  Keystroke  Dynamics  Represented  by  Bio-matrix 

2.1  Indirect  Method  to  Measure  Keystroke  Dynamics 

The  process  illustrated  in  Fig.  1  has  two  phases:  registration  and  authentication. 
In  registration  phase,  user  is  required  to  input  his  username  and  password  Nr 
times  (15  times  in  our  experiments).  Of  Nr  register  times,  there  are  Nrs  times  in 
silent  environment  without  noise  to  determine  initial  parameter  values.  After  reg¬ 
istering.  user  will  be  authenticated  when  accessing  the  system  again.  The  sound 
signals  received  when  user  types  on  keyboard  are  analyzed.  The  spectro  sound 
signals  of  typing  pattern  arc  standardized  and  translated  to  the  keystroke  dy¬ 
namics  bio-rnatrix  in  pre-processing  phase.  The  ICA  second  architecture,  then,  is 
applied  to  ex  tract  keystroke  dynamics  features  from  the  bio-matrix.  The  feature 
vector  (ICA  representation)  is  used  as  an  input  of  FANN  for  training  (registra¬ 
tion)  and  testing  (authentication). 
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(a) 


(*>) 


Fig,  1.  Registration  (a)  and  authentication  (b)  using  keystroke  dynamics 


2.2  Pre-processing 


The  original  sound  signal  is  pre-processed  to  make  the  correlative  keystroke 
dynamics  bio-matrix.  Fig.  2(a)  is  an  example  of  sound  signals  of  pressing  and 
releasing  keys.  It  also  shows  the  difference  in  typing  forces.  The  sound  signal 
is  transformed  to  frequency  domain  by  short-time  Fourier  transform.  Gabor 
transform  is  used  to  analyze  typing  sound  signal  because  this  transformation  has 
no  cross-term  and  avoids  the  confusion  between  noise  and  non-noise  components. 
Moreover,  this  transformation  has  lower  computational  complexity  so  it  improves 
the  speed.  Fig.  2(b)  displays  spectrogram  of  ‘onetntall'  typing  pattern. 

At  registration  phase,  with  the  first  Nrs  registering  times  in  silent  environ¬ 
ment,  the  threshold  values  are  calculated  for  each  user  (including  high  frequency 
threshold  OF  high  and  low  frequency  threshold  0F\ow ). 
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where,  Nrs  is  number  of  register  times  in  sik'iit  environment.  /!  is  frequency 
value  of  the  ith  time,  j  is  index  of  signal  frequency  for  each  register. 

The  spectrogram  of  original  sound  signal  is  used  to  create  the  keyst  roke  dy¬ 
namics  bio- matrix  described  in  next  section. 
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2.3  The  Keystroke  Dynamics  Bio-matrix 


The  original  typing  signal  is  filtered  by  band-pass  filter  with  OF  high  t  OF  low  in 
order  to  get  exact  typing  frequency  domain.  An  intensity  matrix  MIntxNf 
is  made  from  that  domain  which  each  element  of  the  matrix  is  calculated  in 
formula  (5). 


5f  = 


ST  =  — 

OF  high  Ob  low 

Nf 


(3) 


(4) 


where,  T  is  the  time  that  user  inputs  password  AY  is  predefined  number  of 
sections  of  T  time,  St  is  time  length  of  each  time  section;  AY  is  predefined  num¬ 
ber  of  divided  sections  in  [0Fiow ,  OF  high  ]  interval.  Sy  is  length  of  each  frequency 
section. 

s5f  yhF  +  OFnw 

MIx,y  =  Y  Y  r‘i  (5) 

*=(*— l)flr  j=(y-l)8F+&Fi»w 


where,  x  =  1..AY,  V  ~  l  .AY,  Iij  is  intensity  of  frequency  /j. 

From  the  intensity  matrix,  maximum  intensity  and  minimum  intensity  of  all 
elements  are  calculated  in  formula  (G),  (7). 


=  raax(M/i,w) 

(6) 

=  min  (MIs<y) 

(7) 

where,  x  —  1..AY,  y  —  1  ..AY* 

We  propose  the  keystroke  dynamics  bio-matrix  bioM ntXnf  whose  elements 
(called  bio-cell)  represent  the  correlative  intensity  of  the  elements  of  the  intensity 
matrix  i\//^rX^F. 
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Fig.  2.  (a)  Time-sequence  signal  of  password  ‘onetntall’.  (b)  Spectrogram  of  password 
‘oiietntair. 
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where.  A 7  is  predefined  number  of  intensity  sections  of  the  intensity  matrix 

A  /  /  \  rjn  X  iY  f.  • 

In  the  next  section,  we  describe  the  ICA  second  architecture  to  extract  feature 
vector  from  the  keystroke  dynamics  bio-matrix. 


3  Extracting  Keystroke  Dynamics  Feature  by  ICA 

Independent  Component  Analysis  (ICA)  minimizes  both  second-order  and 
higher-order  dependencies  in  the  input  data  and  attempts  to  find  the  basis 
along  which  the  data  (when  projected  onto  them)  are  -  statistically  indepen¬ 
dent.  Bartlett  et  al  [9  provided  two  architectures  of  ICA  for  face  recognition 
task:  Architecture  1  -  statistically  independent  basis  images,  and  Architecture  11 
-  statistically  independent  coefficients. 

In  this  keystroke  dynamics  recognition  problem,  our  goal  is  to  find  coefficients 
of  feature  vectors  to  achieve  the  most  independent  in  desire.  Therefore,  in  this 
paper,  we  selected  architecture  II  of  ICA  method  for  the  keystroke  dynamics 
representat  ion.  A  number  of  algorithms  for  performing  ICA  have  been  proposed 
(see  [8]  for  reviews).  In  this  paper,  we  apply  FastiCA  algorithm  developed  by 
Aapo  Hyvarinen  [8]  for  our  experiments. 

Architt eture  II:  Statistically  Independent  Coefficients.  The  goal  in  this  approach 
is  to  find  a  set  of  statistically  independent  coefficients.  A  similar  approach  was 
used  for  face  recognition  [9]  and  for  facial  feature  extraction  [10]. 

We  organize'  a  data  matrix  X  so  that  keystroke  dynamics*  bio-matrices  are  in 
columns  and  the  biocells  are  in  rows.  Bio-cell  i  and  j  are  independent  if  when 
moving  across  the  entire  sot  of  the  bio-matrices,  it  is  not  possible  to  predict 
the  value  taken  by  bio-cell  1  based  on  the  corresponding  value  taken  by  bio-cell 
j  on  the  same  bio-matrix.  The  goal  in  architecture  1  is  using  ICA  to  find  a 
sot  of  statistically  independent  basic  bio- mat  rices.  Although  basic  bio-matrices 
found  in  architecture  1  are  approximately  independent,  when  projecting  down 
statistically  independent  basic  bio-matrices  subspace,  feature  vectors  of  each 
bio- matrix  arc  not  necessarily  independent.  Architecture  H  uses  ICA  to  find  a 
representation  which  coefficients  arc  used  to  represent  a  bio-matrix  in  the  basic 
bio-mat  rices  subspace  being  statistically  independent.  Each  row  of  weight  matrix 
W  is  a  bio- matrix.  A.  an  inverse  matrix  of  IT,  contains  basic  bio-matrices  in  its 
columns.  Statistically  independent  coefficients  in  S  will  be  recovered  in  columns 
of  U  (see  Fig.  3);  each  column  of  U  contains  coefficients  for  combination  of  basic 
bio-matrices  in  A  to  const  ruct  bio-matrices  of  X . 

Architecture  II  is  implemented  through  the  following  stops: 

Assumption  that  we  have  n  bio-matrices:  each  bio-matrix  has  p  bio-cells.  There¬ 
fore,  data  matrix  X  has  an  order  of  p  x  n. 
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W 


Weights 
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produced  by 
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coefficient  1 


coefficient  2 


coefficient  m 


Independent  coefficients 


Fig.  3.  Finding  coefficients  which  presentation  bio- m  at  rices  are  independent 


1.  Let  I{  be  a  p  x  in  matrix  containing  the  first  m  eigenvectors  of  a  set  of  n 
keystroke  dynamics  bio-matrices  m  its  columns. 

2.  Calculating  set  of  principle  component  of  set  of  bio-matriees  in  X: 

C=  RT  xX  (9) 

3.  The  coefficients  for  linearly  combining  the  basic  bio-matrices  in  A  are 
determined: 

U  -  W  x  C  (10) 

Assumption  that  we  have  a  set  of  bio-matrices  for  testing  A 'test,  feature  extrac¬ 
tion  of  X tetri  is  computed  through  the  following  steps:  firstly,  from  Xtest,  we 
calculate  a  set  of  principle  component  of  Xtest  by: 

Ctest  ~  Rl  X  A  test  (11) 

Then,  a  set  of  feature  vectors  of  A 'test  in  the  basic  bio-mat  rices  space  is  calculated 
by: 

Utrst  =  W  X  Ctest  (12) 

Each  column  of  Utest  is  a  feature  vector  corresponding  with  each  bio-inatrix  of 
Xtest  • 

Firstly,  to  keystroke  dynamics  representation  with  ICA  method,  we  apply 
PCA  to  project  the  data  into  a  m  dimensional  subspace  with  purpose  to  control 
the  number  of  independent  components  made  by  ICA,  and  then  ICA  is  applied 
to  the  eigenvectors  to  minimize  the  statistical  dependence  of  feature  vectors  in 
the  basic  bio-inatrices  space.  Thus,  PCA  uncorrelated  input  data,  high-order 
dependence  remain  will  be  separated  by  ICA. 

4  Experimental  Results 

In  our  experiments,  Ny  users  are  invited  to  test  the  proposed  authentication 
system  with  2  experiments.  Experiment  1  is  to  authenticate  in  silent  environ¬ 
ment  without  any  noise.  Experiment  2  is  to  authenticate  in  both  silent  envi¬ 
ronment  and  workable  environment  (e.g.  library,  school  yard,  coffee  shop  ...). 
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Table  1.  Parameters  of  experiments  1  and  2 


Experiment 

Nu 

Nn 

A rns 

A  Auth  Na tiack 

Nt 

AV 

Ni 

t 

20 

15 

15 

10 

10 

100 

100 

256 

2 

20 

15 

5 

10 

10 

100 

100 

256 

In  each  experiment,  after  registering,  user  accesses  the  system  N^uth  times  to 
test  authentication  ability  of  the  system.  In  addition,  every  user's  username  and 
password  is  public  and  five  other  persons  will  vise'  that  information  to  attack 
the  system.  An  intruder  v, ill  attack  one  account  N Attack  times.  We  choose  the 
number  of  time  sections  Nr  is  100,  the  number  of  frequency  sections  Nr  is  100 
and  the  intensity  sections  N/  is  256.  Table  I  shows  the  paramet  ers  of  experiment 
1  and  experiment  2. 

The  recognizer  was  implemented  by  the  neural  network  method.  Fast  Artifi¬ 
cial  Neural  Network  is  used  in  our  experiments.  Fast  Artificial  Neural  Network 
Library  (FANN  [16]),  which  is  a  free  open  source  neural  network  library,  imple¬ 
ments  multilayer  artificial  neural  networks  in  C  language  and  supports  for  both 
fully  connected  and  sparsely  connected  networks.  FANN  has  been  used  in  many 
studies.  FANN  implementation  includes: 

Training  (registration)  step:  assumption  that  we  have  No  classes  (No  different 
users),  training  with  FANN  will  create  Nt  sets  of  weights.  Each  set  of  weights 
corresponds  with  each  class  (each  user). 

Testing  (authentication)  step:  the  input  is  the  ICA  feature  vector  of  user's 
keystroke  dynamics  bio-matrix  (one  of  the  Nu  users  mentioned  above);  this 
feature  vector  is  tested  with  Nu  sets  of  weights  which  were  created  in  the  training 
step,  this  user  belongs  to  the  class  which  corresponds  with  the  set  of  weights 
having  the  biggest  output. 

One  of  our  experimental  results  show  that  the  proposed  authentication  system 
is  acceptable  with  4.2%  FAR  5.6%  FRR  in  silent  environment  and  3.9%  FAR, 
6.1%  FRR  in  workable  environment.  Table  2  shows  the  results  of  experiments 
I  and  2.  The  results  in  both  silent  environment  and  workable  environment  are 
not  deviated  so  much.  It  shows  that  the  keystroke  dynamics  features  extracted 
by  ICA  are  quite  different  in  silent  environment  or  workable  environment  . 


Table  2.  Total  FAR  and  FRR  for  experiments  1  and  2 


Experiment 

Number  of 

authentic 

participants 

Number 
of  intruder 
participants 

Number 
of  attacks 

N  umber 
of  suc¬ 

cessful 
attacks 

FAR%  FRR% 

1 

20 

20 

1000 

42 

4.2 

5.6 

2 

20 

20 

1000 

39 

3.9 

6.1 
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Table  3.  Experimental  results 


Silent  environment 

Workable  environment 

Group 

FAR% 

FRR.% 

FAR% 

FRR% 

1 

4.20 

5.60 

3.90 

6.10 

2 

4.10 

5.50 

3.80 

6.20 

3 

3.90 

5.70 

3.70 

6.20 

4 

4.00 

5.20 

3.70 

6.30 

5 

4.50 

5.00 

4.00 

6.00 

6 

4.10 

5.70 

3.70 

6.10 

7 

4.00 

5.70 

4.00 

5.90 

8 

4.20 

5.50 

4.00 

6.00 

9 

3.90 

5.90 

3.60 

6.40 

10 

4.20 

5.80 

3.80 

6.10 

11 

4.10 

5.40 

3.90 

6.00 

12 

4.30 

5.70 

3.80 

5.80 

13 

4.00 

5.50 

3.90 

6.40 

14 

4.20 

5.60 

4.00 

6.20 

15 

4.10 

5.50 

3.70 

6.10 

Average 

4.12 

5.55 

3.83 

6.12 

Table  4.  comparison  of  our  results  with  previous  efforts 


Research 

Number  of 
participants 

Training 

samples 

Password 

string 

FAR%  FRR% 

Leggct  and  Williams  (1988)  [11] 

36 

12 

large 

5.00 

5.50 

Joyce  and  Gupta  [12] 

33 

8 

4 

13.30 

0.17 

l)e  Rn  and  Eloff  [13] 

29 

Varies 
(2  to  10) 

1 

2.80 

7.40 

Haider  et  al.  [14] 

Not 

mentioned 

15 

1 

6.00 

2.00 

Armijo  ot  al.  [15] 

30 

10 

1 

1.89 

1.45 

Eltahir  et  al.  [6] 

23 

20 

1 

3.75 

3.04 

Kenneth  Revett  [5] 

(threshold  0.60) 

20 

10 

L 

0.80 

0.90 

Our  proposed  method 
(silent  environment ) 

20 

15 

1 

4.12 

5.55 

Our  proposed  method 
(workable  environment) 

20 

15 

1 

3.83 

6.12 

Other  groups  are  invited  to  test  the  system  like  two  above  experiments.  The 
results  when  testing  in  silent  and  workable  environments  are  summarized  in 
table  3.  They  show  that  the  performance  of  proposed  system  is  feasible  and 
reliable. 
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Table  4  shows  a  comparison  between  results  obtained  here  and  previous  re¬ 
search  efforts.  Note,  that  these  systems  use  different  sample  sizes  with  different 
parameters  and  methodologies  to  measure  the  results.  Nevertheless,  our  pro¬ 
posed  method  gives  comparable  results  with  existing  methods.  This  shows  the 
feasibility  and  reliability  of  using  sound  signals  to  measure  keystroke  dynamics 
for  authentication. 


5  Conclusion 

This  study  proposed  the  indirect  method  to  measure  the  pressure  of  key  typing 
via  sound  signals  so  widespread  deployment  is  easier  because  it.  does  not  use  anv 
specific  device  like  bio- keyboard.  In  addition,  tile  novel  keystroke  dynamics  bio¬ 
matrix  combines  both  typing  time  and  typing  force.  It  is  converted  to  ICA  Feature 
vector  to  authenticate  user  by  FANN  reliably.  rIhe  experimental  results  show 
that  the  proposed  authentication  system  is  feasible  and  reliable.  Besides  that,  it 
shows  that  the  keystroke  dynamics  extraction  using  ICA  second  architecture  is 
effective  and  stable  in  different  environments.  In  future,  we  continue  experiment 
with  many  groups  of  users  in  order  to  apply  this  authentication  solution  in 
practical  problem  having  a  lot  of  users. 
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Abstract.  In  this  paper,  we  present  a  modified  version  of  Incremen¬ 
tal  Kernel  Principal  Component  Analysis  (1KPCA)  which  was  originally 
proposed  by  Takeuclii  et  al.  as  an  online  nonlinear  feature  extraction 
method.  The  proposed  IKPCA  learns  a  high-dimensional  feature  space 
incrementally  by  solving  an  eigenvalue  problem  whose  matrix  size  is  given 
by  the  power  of  the  number  of  independent  dat  a.  In  the  proposed  IKPCA, 
independent  data  are  used  for  calculating  eigenvectors  in  a  feature  space, 
but  they  arc  selected  in  a  low-dirnensional  eigen- feature  space.  Hence,  the 
size  of  an  eigenvalue  problem  is  usually  small,  and  this  allows  IKPCA 
to  learn  eigen-feature  spaces  very  fast  even  though  the  eigenvalue  de¬ 
composition  has  to  be  carried  out  at.  every  learning  stage  The  proposed 
IKPCA  consists  of  two  learning  phases:  initial  learning  phase  and  incre¬ 
mental  learning  phase.  In  the  former,  some  parameters  are  optimized  and 
an  initial  eigen-feat  lire  space  is  computed  by  applying  the  conventional 
KPCA.  In  the  latter,  the  eigen-feature  space  is  increment  ally  updated 
whenever  a  new  data  is*  given.  In  the  experiments,  we  evaluate  the  learn¬ 
ing  time  and  the  approximation  accuracies  of  eigenvectors  and  eigenval¬ 
ues.  The  experimental  results  demonstrate  that  the  proposed  IKPCA 
learns  eigen-feature  spaces  very  fast  with  good  approximation  accuracy. 


1  Introduction 

Eigenspace  analysis  such  as  Principal  Component  Analysis  (PCA)  has  played  an 
important  role  in  classification  tasks  such  as  face  recognition  and  object  recog¬ 
nition.  These  methods  arc  used  for  finding  a  small  number  of  useful  features 
of  target  objects,  and  this  feature  extraction  often  enhances  the  generalization 
performance  of  a  system  as  well  as  the  efficiency  in  memory  and  computation 
costs.  Recently,  Kernel  Principal  Component  Analysis  (KPCA)  [1  has  been  ex¬ 
tensively  studied  as  an  extension  of  PCA.  In  KPCA,  eigen-axes  are  obtained  in  a 
high-dimensional  inner  product  space  called  feature  space.  Since  KPCA  generally 
gives  a  set  of  nonlinear  axes  in  an  input,  space,  a  complex  data  distribution  can 
be  represented  with  a  small  number  of  such  axes;  hence,  it  is  expected  that  this 
makes  the  generalization  performance  of  a  classifier  improved  more  efficiently. 
However,  KPCA  is  usually  applied  to  a  static  data  set;  therefore,  it  is  not  suited 
for  the  learning*  of  a  dynamic  data  set,  which  means  that  only  a  small  subset  of 
data  is  given  at  a  time  and  such  subsets  are  provided  sequentially  over  time.  Al¬ 
though  the  conventional  KPCA  can  be  used  for  incremental  learning  if  all  data 
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arc  stored  in  memory,  this  would  bo  an  unrealistic  usage  for  large-scale  high- 
dimensional  data  such  as  face  images  [2].  In  this  case,  an  incremental  learning 
algorithm  for  KPCA,  which  can  update  an  eigenspace  model  without  keeping 
all  the  past  training  data,  is  solicited  under  realistic  environments. 

Many  incremental  algorithms  for  eigenspace  learning  have  been  proposed  so 
far.  Most  of  them  are  Incremental  PCA  (IPCA)  [3, 4, 5, 6]  or  Incremental  Lin¬ 
ear  Discriminant  Analysis  (1LDA)  [7].  To  our  best  knowledge,  there  have  been 
proposed  only  a  few  incremental  learning  algorithms  of  KPCA  [5,8,9'.  This  is 
because  the  eigenvectors  in  a  feature  space  cannot  be  updated  in  a  direct  way. 
To  solve  this  problem,  Takeuchi  et  al.  proposed  an  Incremental  KPCA  (1KPCA) 
algorithm  [9]  which  was  extended  from  the  Incremental  PCA  (IPCA)  algorithm 
proposed  by  Hall  et  al.  [3].  In  the  Takeuchi  et  al.’s  IKPCA,  eigenvectors  are 
represented  by  linearly  independent  training  data  which  are  selected  in  a  low- 
dimensional  eigen-feature  space.  Therefore,  the  number  of  training  data  to  be 
kept  in  memory  is  very  small  as  compared  with  the  conventional  IKPCA  algo¬ 
rithms  [5,8].  This  allows  the  Takeuchi  et  al.'s  IKPCA  to  learn  an  eigen-feature 
space  very  fast  even  though  the  eigenvalue  decomposition  has  to  be  carried  out 
at  every  learning  stage. 

In  this  paper,  we  fix  the  mistakes  in  the  derivation  of  the  update  equations 
on  the  accumulation  ratio  in  the  Takeuchi  et  al.'s  IKPCA,  which  made  the 
eigenspace  learning  a  little  unstable.  We  further  extend  the  IKPCA  algorithm 
such  that  parameters  are  automatically  optimized  for  initial  training  data  based 
on  a  cross-validation  method.  The  proposed  IKPCA  consists  of  two  learning 
phases:  initial  learning  phase  and  incremental  learning  phase.  In  the  former 
phase,  some  parameters  are  optimized  and  an  initial  eigen-feature  space  is  com¬ 
puted  by  applying  the  conventional  KPCA.  In  the  latter  phase,  the  eigen- feature 
space  is  incrementally  updated  whenever  a  new  training  data  is  given. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  gives  a  brief  review  on 
KPCA.  In  Section  3,  we  present  a  modified  algorithm  for  the  Takeuchi  et  al.'s 
IKPCA  which  had  some  mistakes  in  the  algorithm  derivation.  Section  4  shows 
the  experimental  results  to  verify  the  effectiveness  of  IKPCA  under  incremental 
learning  environments.  Finally,  we  give  conclusions  in  Section  5. 

2  Kernel  Principal  Component  Analysis 

In  KPCA,  an  n-dirnensional  input  x  is  mapped  to  an  /-dimensional  vector  (p(x) 
where  <£(•)  is  the  function  that  maps  ail  input,  into  the  /-dimensional  feature 
space.  To  obtain  eigenvectors  in  the  feature  space,  first  we  define  the  following 
covariance  matrix: 

1  N 

Q  = —  Y^(<>(xi)  -  c){<t>(Xi)  -  c)1  (1) 

*-l 

where  N  is  the  number  of  input  data  and  c  =  J2iL\  <A{ah)-  The  eigenvectors 
are  obtained  by  solving  the  following  eigenvalue  problem: 

QZ  =  ZA 


(2) 
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where  Z  and  A  are  an  eigenvector  matrix  and  an  eigenvalue  matrix,  respectively. 
Practically,  however,  solving  this  problem  is  hardly  carried  out  in  a  direct  way 
because  the  dimensions  of  a  feature  space  are  generally  very  high  and  it  could 
be  infinite.  To  avoid  the  explicit  calculation  in  the  feature  space,  so-called  kernel 
trick  is  applied. 

Without  loss  of  generality,  we  assume  that  a  set  of  in  linearly  independent  data 
4>in  —  [(^(x,).  •  •  •  ,0(xm)]  (m  <  N)  span  a  feature  space  where  N  training  data 
{<j)(x\ ),•••,  ^(sca?)}  are  distributed.  Then  the  ith  eigenvector  2*  is  represented 
by 

an 


Z,  =  [0(®1  ),•••,  </>(*««)] 


— 


(••5) 


o 


m  i 


where  «,  —  [a^,  •  •  ■ ,  a7m] 1  (i  —  1  ■  •  • .  rn)  is  a  coefficient  vector.  Let  ns  define 
the  coefficient  matrix  Arn  =  [aj,  ■  ■  ■ .  «7?1]  and  the  following  kernel  matrices: 


k{xi,xi)  •  •  • 

i - 

-ic 

A(x^r,X|)  ••• 

- 1 

-id 

k(x i,x\)  •  •  • 

1? 

_ i 

_  k  {Xm ,  X  i  ) 

- 1 

<H 

e 

(4) 


(5) 


where  fc(-)  is  a  kernel  function  and  k(xt.Xj)  —  <t>(x±)1  Substituting  Eq. 

(1)  and  Eqs.(3)-(5)  into  Eq.  (2),  we  can  derive  the  following  kernel  eigenvalue 
problem  [1]: 


~L  'HTXm(Iy  -  1  N)HNm(L~')T(LTAm )  =  (LrAm)Am  (6) 


where  L 1  a*  (i.e..  the  ith  column  vector  of  L 1  Am)  is  the  ith  eigenvector  spanning 
a  feature  space  and  A,  is  the  corresponding  eigenvalue:  7/v  is  the  N  x  N  unit 
matrix  and  Tv  is  the  N  x  N  matrix  whose  elements  are  all  1/Ar.  Here.  L  is 
obtained  by  the  Cholesky  factorization  for  Hmm  (i.e.,  Hmm  =  LL  ). 

Next  assume  that  we  select  the  first  d  principal  components  from  the  feature 
space.  As  a  criterion  of  selecting  these  components,  the  following  accumulation 
ratio  is  often  adopted: 


C(d)  = 


El,  A, 

E1,V 


(7) 


The  accumulation  ratio  C(d)  shows  how  much  information  remains  in  the  eigen- 
feature  space  after  the  d  components  are  selected.  The  dimensionality  d  is  se¬ 
lected  such  that  the  accumulation  ratio  for  the  d-dimensioiial  eigen- feature  space 
is  larger  than  a  certain  threshold  0. 
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3  Incremental  Kernel  Principal  Component  Analysis 
(IKPCA) 

The  proposed  Incremental  Kernel  PCA  (IKPCA)  is  also  derived  from  the  eigen¬ 
value  problem  in  Eq.  (2)  where  a  covariance  matrix  Q  in  Eq.  (1)  is  included. 
However,  as  in  the  derivation  of  KPCA,  the  matrix  decomposition  of  Q  is  not 
performed  in  a  direct  way  because  the  size  of  a  covariance  matrix  is  l  x  /  where 
/  is  very  large  in  general.  From  Eq.  (6),  the  matrix  size  for  KPCA  is  actually 
reduced  to  m  x  m  where  m  is  the  number  of  independent  data  in  a  feature  space. 

Although  the  eigenvalue  decomposition  of  the  left-hand  side  matrix  in  Eq.  (6) 
is  feasible  to  carry  out,  the  computation  costs  could  increase  under  incremental 
learning  settings  [9].  To  make  IKPCA  more  efficient,  Takeuchi  et  al.  [9]  proposed 
an  improved  IKPCA.  In  this  IKPCA,  the  matrix  size  is  further  reduced  to  d  x 
d  where  d  is  the  dimensions  of  an  eigen-feature  space  that  are  usually  much 
smaller  than  /,  especially  when  the  RBF  kernel  is  used.  In  the  derivation  of  the 
Takeuchi  et  al.’s  IKPCA  we  should  note  that  eigenvectors  in  a  feature  space  are 
not  explicitly  calculated;  thus,  every  computation  in  a  feature  space  should  he 
transformed  into  a  feasible  form  based  on  the  so-called  kernel  trick. 

The  learning  is  divided  into  two  phases:  initial  learning  phase  and  incremental 
learning  phase.  In  the  former  phase,  some  parameters  are  optimized  and  an  initial 
eigen-feature  space  is  computed  by  performing  the  conventional  KPCA.  In  the 
latter  phase,  the  eigen-feature  space  is  incrementally  updated  whenever  a  new 
training  data  is  given.  In  the  following,  let  us  explain  the  two  learning  phases  in 
more  detail. 

3.1  Initial  Learning  Phase 

Assume  that  N  training  data  Xo  =  j  are  given  with  their  class  informa¬ 

tion  at  the  initial  learning  stage.  Let  us  adopt  the  following  RBF  kernel  function 
here: 

k(x ,  x)  =  exp  (— 7|[x  -  x'||2)  (8) 

where  7  is  the  kernel  parameter.  I11  this  case,  there  are  two  parameters  to  he 
optimized:  the  threshold  0  for  the  accumulation  ratio  in  Eq.  (7)  and  the  kernel 
parameter  7  in  Eq.  (8). 

The  purpose  of  the  initial  learning  phase  is  not  only  to  compute  an  initial 
eigen- feature  space  but  also  to  find  appropriate  values  of  0  and  7.  The  former 
computation  is  basically  carried  out  by  applying  KPCA  to  initial  training  data 
Xo.  and  the  latter  operation  is  conducted  with  a  cross-validation  method.  The 
pseudo-code  of  the  initial  learning  phase  is  shown  in  Algorithm  1. 

As  shown  in  Algorithm  1  the  first  procedure  is  to  find  optimal  values  of  0 
and  7  from  a  candidate  set  using  a  cross-validation  method.  If  we  adopt  the 
fc-fold  cross-validation,  the  following  procedures  are  conducted  for  every  pair 
of  0  and  7.  First,  An  is  divided  into  k  subsets.  The  (A*  —  1)  subsets  are  used 
for  training  and  the  remaining  one  is  used  for  test.  The  conventional  KPCA  is 
applied  to  the  training  data  to  obtain  an  eigen-feature  space  model,  and  the 
prototypes  of  the  nearest  neighbor  classifier  are  obtained  by  projecting  a  certain 
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Algorithm  1.  Learn  Initial  Eigeri- feature  Space 
Input:  Training  data  X  \  =  {x*}£Li. 

Output:  Eigen- feature  space  model  Q  —  {Xd-  Ad.  Ad,  &d  c.  ||c||2.  Ar}.  threshold  0, 
kernel  parameter  7. 

1:  Perform  a  cross-validation  method  using  X to  find  optimal  values  of  0  and  7. 

2:  Select  tn  data  X7n  =  {sbi}"l|  from  X  \  such  that  the  data  in  a  feature  space 
<I>„,  =  {0(Xi)}^i  are  linearly  independent. 

3:  Solve  the  eigenvalue  problem  in  Eq.  (6)  w.r.t.  to  obtain  A„,  and  At„. 

4:  Obtain  the  minimum  d  such  that  the  accumulation  ratio  in  Eq.  (7)  holds  C(d)  >  0. 

5:  Define  a  coefficient  matrix  Ad  that  consists  of  the  first  d  column  vectors  of  Am. 

6:  Select  d  independent  data  <Pd  ~  t  such  that  D  in  Eq.  (9)  is  full  rank. 

7:  Solve  the  eigenvalue  problem  111  Eq.  (6)  w.r.t.  <Pd  to  obtain  Ad  and  Ad- 
H:  Calculate  ||c||2  and  in  Eqs.  (10)  and  (11). 


number  of  training  data  to  the  eigen-feature  space.  Them  the  test  data  are 
projected  to  the  eigen-feature  space,  and  the  recognition  accuracy  is  calculated 
based  on  the  nearest  neighbor  method.  The  above  procedure  is  repeated  for  the 
k  combinations  of  training  and  test  subsets  to  estimate  the  average  recognition 
performance.  Finally,  the  values  of  0  and  7  with  the  highest  average  recognit  ion 
accuracy  are  selected. 

After  determining  0  and  7.  linearly  independent,  data  in  a  feature  space  are 
selected  from  Xq.  Let  the  number  of  such  independent  data  be  w  and  the  set 
of  independent  data  be  *Pm  —  {<£(i:f Then,  the  eigenvalue  problem  in 
Eq-  (fi)  is  solved  to  obtain  the  coefficient  matrix  Arn  and  the  eigenvalue  matrix 
Atn.  To  determine  the  dimensions  of  an  eigen- feat  lire  space,  the  minimum  d  is 
found  such  that  the  accumulation  ratio  C(d)  in  Eq.  (7)  is  larger  than  or  equal 
to  the  threshold  0.  After  redefining  the  coefficient  matrix  A<i  by  taking  the 
first  (I  column  vectors  of  d  linearly  independent  data  X,t  =  mre 

selected  such  that  the  following  kernel  matrix  D  defined  from  the  projection  of 
<P,n  =  (0(.Tt)} ™  j  to  the  eigen-feature  space  is  full  rank. 


D  =  Ajt  (*(*!)  -  C)  .  •  •  •  .  K  («K*m )  -  C) 

where 


(9) 


4>Lc  = 


N 


^  k (x  1 ,  X i ) ,  *  •  *  ^  k  '  0C  jf t .  OC-i) 

_;=1  7=1 


Then,  the  eigenvalue  problem  in  Eq.  (fi)  w.r.t.  <&d  is  solved  to  recalculate  the 
coefficient  matrix  A(i  and  the  eigenvalue  matrix  A({.  Since  a  data  mean  c  cannot 


1  Note  that  the  data  set  XTT1  =  {xi}™  1  is  actually  kept  in  memory  instead  of  4>„ 
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be  held  in  ail  explicit  form,  the  following  two  terms  on  c  are  calculated  and  kept 
for  the  next  incremental  round: 


1  N  N 

cii2  = 

j=i 

‘  N  N 

.1=1 


N[ 

<$>  ,  c  =  — 

d  N 


(10) 


(11) 


Let  us  denote  the  calculated  eigen- feature  space  model  by  the  following  sextuple: 
tt  =  {Xd.Ad,Ad.‘P1]c.  ||cf,JV}. 


Note1  that  only  d  training  data  X(i  are  kept  in  memory  to  update  an  eigen- feat  me 
space  incrementally. 


3.2  Incremental  Learning  Phase 

The  pseudo-code  of  the  main  procedure  in  IKPOA  is  shown  in  Algorithm  2. 
After  finishing  the  initial  learning  phase,  the  incremental  learning  is  conducted 
whenever  a  new  training  data  x  is  given. 

At  first,  the  numerator  and  the  denominator  of  C(d)  in  Eq.  (7)  are  updated 
as  follows2: 


i=  1 


(N  + 1)3 


(N  +  lf 


N 


vu -*!«)}’]  (i2) 


t=l 


N2  r(iV+l)2 
(TV  +  1):1  L  N 


in 

Y,  ^  +  ((f>(x)T  —  2(j)(x)Tc 

i=i 


(13) 


If  C'(d.)  >  0  is  satisfied,  the  given  data  x  is  well  represented  by  the  current 
d-dimensional  eigen-feature  space.  Therefore,  t  lie  eigen-feature  space  model  42 
is  updated  without  increasing  the  dimensions  to  adapt  the  new  data  x.  This  is 
done  by  solving  the  following  eigenvalue  problem: 


N 

N  4- 1 


gg 1  \ 
n  +  i) 


R  =  RA'd 


(14) 


where  A'd  and  R  respectively  correspond  to  a  new  eigenvalue  matrix  and  a 
rotation  matrix;  g  is  given  as  follows: 


a 


T 


-*Ic) 


g  - 


(15) 


2  In  the  previous  work  [9],  we  had  wrong  update  equations  of  the  accumulation  ratio. 
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Algorithm  2.  Incremental  Kernel  Principal  Component  Analysis  (IKPCA) 
Input:  Initial  training  data  Xa  =  x. 

Output:  Eigen-feature  space  model  il  =  {X,/,  A,i,  A,t,  c,  j|e||2.iV}. 

I:  //  Initial  Learning  Phase 

2:  Perform  Learn  Initial  Eigen-feature  Space. 

3:  //  Incremental  Learning  Phase. 

4:  loop 

5:  Input:  Training  data  x. 

6:  Update  C'(d)  using  Eqs.  (7),  (12),  (13). 

7:  if  C'(d)  >  0  then 

8:  Solve  the  eigenvalue  problem  in  Eq  (14)  to  obtain  and  R. 

9:  else 

10:  Solve  the  eigenvalue  problem  in  Eq  (17)  to  obtain  Afti+l  and  R 

II:  Add  (p(x )  into  the  independent  data  set  <— 

12:  Calculate  / 2  using  Eq.  (18). 

13:  Update  C'(d  4-  1)  by  adding  / 2  to  the  numerator  of  C((^)  in  Eq  (7). 

14:  Increment  the  eigen-feature  space  dimensions:  r/ «—  </  4  1. 

15:  end  if 

16:  Update  the  coefficient  matrix  A({  using  Eq.  (19). 

17:  Update  ||c||2  and  c  using  Eqs  (20),  (21). 

18:  increment  the  number  of  data:  N  <—  N  4-  1. 

19:  end  loop 


On  the  other  hand,  if  the  accumulation  ratio  Cr(<l)  is  smaller  than  0,  it  means 
the  given  data  x  includes  a  certain  amount  of  energy  in  the  complimentary 
eigen-feature  space.  Therefore,  the  dimensions  of  the  eigen-feature  space  should 
he  augmented  ill  the  direction  of  the  following  residue  vector  h: 


h  %  [<£,;  <p(x)\ 


£?=  a, )<*, 

1 


(16) 


hi  order  to 
solved; 


update  the  eigen-feature  space,  the  following  eigenvalue  problem  is 


;V  ( 

A/ 

0 

l 

1 

gg1  fg 

N  +  1  V 

or 

0 

x+\ 

JgT  f\ 

R  =  RA 


</+i 


(17) 


v\  here 


/  =  Ax)'  <*<)(*!  (Kd(x)  +  k(x.x)  -  (31  (0],c)j  . 

(18) 

From  Eq.  (16),  in  order  to  represent  h,  the  training  data  6(x)  should  he  added 
to  the  linearly  independent  data  set.  Hence,  should  lx*  update  as  follows: 
3\/+i  <—  <&d  <t>{x)\.  Furthermore,  the  accumulation  ratio  G'(d  4-  1)  should  be 
recalculated  after  adding  the  new  eigen-axis.  This  can  be  done  by  adding  f  2  to 
the  numerator  in  Eq.  (7). 
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After  calculating  the  rotation  matrix  R,  all  the  eigenvectors  are  rotated.  The 
rotation  of  eigenvectors  can  be  equivalently  conducted  by  updating  the  coefficient 
matrix  as  follows: 

A'd.  =  AdR  (19) 

where  d!  is  equal  to  d  +  1  if  the  dimensional  augmentation  occurs.  Finally,  the 
two  terms  on  c  are  updated  as  follows: 


/ 1|  2 


N2 

(N  +  l)2 


|j9r(^’c)  + 


k(x,x)\ 

N2  ) 


(20) 


rfc' =  7^-l{N(*Tdc)  +  Kd(x)}. 


(21) 


4  Performance  Evaluation 

4.1  Experimental  Setup 

In  this  section,  we  evaluate  how  well  the  proposed  IKPCA  works  as  an  online 
feature  extraction  method.  For  this  purpose,  we  adopt  the  following  two  perfor¬ 
mance  scales:  (1)  learning  time  and  (2)  learning  accuracy  of  ei  gens  paces.  The 
learning  time  is  defined  as  the  time  to  finish  learning  a  sequence  of  all  training 
data.  The  learning  accuracy  is  evaluated  using  the  average  direction  cosine  be¬ 
tween  two  corresponding  eigenvectors.  Ideally,  the  eigenvectors  of  the  proposed 
IKPCA  are  equivalent  with  those  of  KPCA  in  which  all  the  training  data  are 
simultaneously  trained  in  a  batch.  Therefore,  the  direction  cosine  between  the 
two  corresponding  eigenvectors  of  IKPCA  and  KPCA  is  evaluated  to  see  the 
ident  ity  of  eigenspaces. 

Six  data  sets  are  selected  from  the  UCI  Machine  Learning  Repository  [11]. 
The  information  on  these  data  sets  is  shown  in  Table  1.  For  the  Vowel-context, 
Adult,  and  Advertisement  data  sets,  we  randomly  select  up  to  1,000  data  from 
the  original  training  data.  For  the  other  data  sets,  training  and  test  data  are  not 
separated;  thus,  we  randomly  select  1  000  data  from  the  whole  data. 

At  the  initial  learning  stage,  10%  of  training  data  are  randomly  selected 
as  initial  training  data.  The  remaining  90%  are  given  to  IKPCA  one  by  one. 


Table  1.  Evaluated  data  sets 


#  Attributes 

#  Classes 

#  Training  Data 

Vowel-context 

10 

10 

528 

Adult 

14 

2 

1 ,000 

Segmentation 

19 

7 

1,000 

Landsat 

30 

7 

1,000 

Ozone 

72 

2 

1,000 

Advertisement 

1,558 

2 

1,000 
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Table  2.  Averages  and  standard  deviations  of  learning  time  (sec.).  The  results  for 
KPCA  show  the  time  to  learn  all  data  in  a  batch. 


KPCA 

CS-IKPCA 

IKPCA 

Vowel-context 

5.7  ±  2.4 

80.1  ±3.2 

0.33  ±  0.18 

Adult 

28,2  ±  11.1 

339  ±  25 

2.5  ±  10.2 

Segmentation 

21,8  ±4,3 

622  ±  43 

2.1  ±3,3 

Landsat 

225  ±218 

259  ±  18 

1,6  ±2.5 

Ozone 

392  ±  108 

697  ±  34 

4.9  ±  16,7 

Advertisement 

265  ±  1 16 

766  ±  11 

37.2  ±51.3 

Since  the  performance  of  incremental  learning  generally  depends  on  t  he  se¬ 
quence  of  training  data,  the  experiments  are  carried  out  for  50  different,  train¬ 
ing  sequences  to  evaluate  the  average  performance.  1  he  parameters  0  and  7 
in  IK  PC  A  are  determined  by  performing  the  5-fold  cross-validation,  and  they 
are  selected  from  the  following  candidate  sets:  0  =  {80, 85,  90, 95, 99}[%],  7  = 
{0.05. 0.01 . 0.005.  0.001 , 0.0005,  0.0001 , 0.00005, 0.00001 }. 

4.2  Learning  Time 

Table  2  shows  the  results  of  the  learning  time  (sec.)  for  KPCA,  Chiu  &  Suter’s 
1KPCA  (OS-IKPOA)  [8],  and  the  proposed  IKPCA.  For  CS-IKPCA,  the  num¬ 
ber  of  eigenvectors  r  and  the  number  of  preimages  p  should  be  determined  in 
advance.  Here,  r  is  set  to  the  average  dimensions  of  the  eigen-feature  spaces 
obtained  by  KPCA,  and  p  is  set  to  10  according  to  the  suggestion  in  8].  More¬ 
over,  the  incremental  learning  in  CS-IKPCA  is  conducted  for  every  30  training 
data  because  it  requires  long  time  to  finish  learning  if  training  data  are  given 
one  by  one.  For  KPCA,  all  the  training  data  are  given  in  a  batch  to  compute 
eigen- feature  vectors. 

As  shown  in  Table  2,  the  learning  time  of  I  KPCA  is  quite  shorter  than  that  of 
KPCA,  although  the  learning  of  KPCA  is  conducted  in  a  batch  mode  (i.e..  the 
number  of  times  to  solve  an  eigenvalue  problem  is  only  once).  This  result  also 
suggests  that  it  is  almost  unfeasible  to  use  KPCA  for  an  incremental  learning 
purpose.  I11  addition,  the  proposed  1KPCA  is  also  quite  faster  than  CS-IKPCA 
even  though  the  number  of  times  to  solve  eigenvalue  problems  in  CS-IKPCA  is 
almost  1/30  as  compared  with  that  in  IKPCA. 

From  the  above  results,  we  can  conclude  that  the  proposal  IKPCA  can  learn 
very  fast  under  incremental  learning  settings. 


4.3  Learning  Accuracy  of  Eigenspace 


The  learning  accuracy  is  measured  based  on  the  similarities  (direction  cosines) 
between  two  eigenvectors  and  the  following  normalized  errors: 


|AUit  _  Ain 

Etv 


(22) 
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Table  3.  Accuracies  of  eigen-feature  snbspaces  obtained  by  IKPCA  against  KPCA: 
(a)  similarities  (direction  cosines)  of  eigenvectors  and  (b)  normalized  errors  of  eigen¬ 
values.  The  values  with  bold  face  fonts  correspond  to  the  principal  components  whose 
eigenvalues  arc  larger  than  5%  of  the  sum  of  all  eigenvalues.  The  results  for  the  first 
10  principal  components  are  shown. 


(a) 


1 

2 

3 

4 

5 

G 

7 

8 

9 

10 

Vowel- con  text 

0.9 

0.97 

0.96 

0.96 

0.93 

0.90 

0.94 

0.95 

0.96 

0.92 

Adult 

1.00 

0.93 

0.90 

0.87 

0.87 

0.83 

0.80 

0.78 

0.73 

0.74 

Segmentation 

0.99 

0.99 

0.95 

0.94 

0.94 

0.95 

0.91 

0.92 

0.90 

0.89 

Landsat 

0.99 

0.9T 

0.96 

0.87 

0.89 

0.88 

0.92 

0.90 

0.89 

0.83 

Ozone 

0.99 

0.98 

0.97 

0.94 

0.90 

0.84 

0.85 

0.85 

0.85 

0.83 

Advertisement 

0.99 

0.98 

0.96 

0.95 

0.94 

0.94 

0.92 

0.88 

0.87 

0.89 

oo 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Vowel-context 

0.018 

0.013 

0.016 

0.009 

0.006 

0.005 

0.003 

0.002 

0.001 

0.002 

Adult 

0.024 

0.009 

0.008 

0.007 

0.0061 

0.007 

0.006 

0.007 

0.007 

0.008 

Segmentation 

0.008 

0.003 

0.006 

0.003 

0.003 

0.002 

0.002 

0.002 

0.002 

0.001 

Landsat 

0.013 

0.018 

0.006 

0.004 

0.002 

0.002 

0.002 

0.002 

0.001 

0.001 

Ozone 

0.014 

0.012 

0.008 

0.006 

0.006 

0.006 

0.005 

0.004 

0.004 

0.004 

Advertisement 

0.012 

0.009 

0.007 

0.007 

0.006 

0.004 

0.00 1 

0.004 

0.004 

0.004 

where  A^at  and  A"1C  are  the  ith  eigenvalues  calculated  by  KPCA  and  IKPCA, 
respectively. 

Tables  3  (a)  and  (b)  show  the  similarities  of  eigenvectors  and  the  normalized 
errors  of  eigenvalues  for  the  first  10  principal  components,  respectively.  Here,  the 
eigenvectors  whose  eigenvalue  is  larger  than  5%  of  the  sum  of  all  eigenvalues  are 
calk'd  major  components  and  the  results  for  the  major  components  are  shown 
in  a  bold  font.  From  Tables  3  (a)  and  (b),  the  major  components  are  well  ap¬ 
proximated  with  high  similarities  (over  0.9)  except  for  the  Adult  data,  and  the 
normalized  errors  are  less  than  2.5%.  From  these  results,  we  conclude  that  the 
proposed  IKPCA  can  approximate  eigenspaces  with  good  accuracy. 


5  Conclusions 

In  this  paper,  we  fix  some  mistakes  in  the  derivation  of  the  Takeuehi  et  al.'s 
IKPCA  [9]  which  made  the  eigenspace  learning  a  little  unstable.  In  addition,  we 
extend  the  IKPCA  algorithm  such  that  parameters  are  automatically  optimized 
for  initial  training  data  using  a  cross-validation  method.  The  proposed  IKPCA 
consists  of  the  two  learning  phases:  initial  learning  phase  and  incremental  learn¬ 
ing  phase.  In  the  former  phase,  the  threshold  of  the  accumulation  ratio  and  the 
kernel  parameter  are  optimized,  and  then  an  initial  eigen-feature  space  is  com¬ 
puted  by  applying  the  conventional  KPCA.  In  the  latter  phase,  the  eigen-feature 
space  is  incrementally  updated  whenever  a  new  training  data  is  given. 
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The  proposed  IKPCA  learns  a  high-dimensional  feature  space  incrementally 
by  solving  an  eigenvalue  problem  whose  matrix  size  is  given  by  the  power  of 
the  number  of  independent  data.  Since  independent  data  are  selected  in  a  low- 
riiineiisional  eigen- feat  lire  space  spanned  by  eigenvectors,  the  matrix  size  in  the 
eigenvalue  problem  is  generally  small,  and  this  allows  IKPCA  to  learn  an  eigen- 
feature  space  very  fast  even  though  the  eigenvalue  decomposition  has  to  be 
carried  out  at  every  learning  stage. 

To  verify  the  effectiveness  of  the  proposed  IKPCA,  the  learning  time  arid  the 
accuracies  of  eigenvectors  and  eigenvalues  are  evaluated  for  t  he  six  benchmark 
data  sets  in  the  UC1  machine  learning  repository.  The  experimental  results  show 
that  the  proposed  IKPCA  can  learn  an  eigen-feature  space  very  fast  compared 
with  the  Chin  &  Suters  IKPCA,  and  accurate  eigenvectors  and  eigenvalues  are 
obtained  especially  for  major  components. 
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Abstract.  Under  extreme  light  conditions,  a  conventional  eolour  CCD 
camera  would  fail  to  render  the  colours*  of  an  object  properly  as  the  visible 
spectrum  is  either  faintly  observable  in  the  scene  or  the  presence  of  glare 
corrupts  the  colours  sensed.  On  the  other  hand  for  darkly-illuminated 
areas,  a  near-infrared  (N1R)  camera  would  sense  stronger  more  discrim- 
inable  signals,  but  could  only  render  the  scene  monocliromatically.  The 
underlying  challenge  in  this  research  is  how  to  adaptively  integrate  a 
monochromatic  NIR  image  with  a  faintly  rendered  colour  image  of  the 
same  darkly  or  very  brightly  lit  scene  to  give  rise  to  improved  eolour 
classification  results  that  discriminate  colours  more  effectively.  This  re¬ 
search  proposes  a  Fuzzy-Genetic  colour  processing  algorithm  that  adap¬ 
tively  marries  together  the  visible  and  near-infrared  spectra  signals  for 
the  purpose  of  colour  object  recognition  The  experiments  were  done  on 
a  scene  with  spatially  varying  illumination  intensities,  using  Fnjifilm’s 
UV/IR  Super  CCD  camera  with  a  sensitivity  range  between  380nm  to 
lOOOnm  in  conjunction  with  NIR  filters.  Results  prove  that  the  proposed 
multi-spectrum  technique  yields  better  eolour  classification  results  than 
utilizing  the  pure  visible  spectrum  alone. 


1  Introduction 

There  is  a  breaking  point  for  colour  classification  techniques  operating  within 
the  limits  of  the  visible  spectrum.  For  very  dark  exploratory  regions,  only  the 
longer  wavelengths  of  light  art'  mostly  present  in  the  scene,  while  tile  others  sig¬ 
nificantly  fade.  On  the  other  hand,  the  presence  of  glare  causes  the  pixel  colours 
to  approach  pure  white.  In  the  electromagnetic  spectrum,  there  is  a  region  that 
corresponds  to  the  noil-visible,  infrared  spectrum  (0. 7-2.4  micro  meter)  [1]  that  is 
not  yet  fully  explored  for  colour  classification.  It  can  be  deduced  that  cultivating 
these  infrared  signals  and  integrating  them  with  the  colour  sensed  values  in  the 
visible  spectrum  will  expand  the  colour  discrimination  capabilities  of  computer 
vision  systems.  However,  the  integration  of  the  signals  is  by  far  non-trivia!  and 
also  requires  that  the  fused  colours  be  discriminable  despite  the  presence  of  gra¬ 
dients  in  the  illumination  intensities.  In  addition,  similarly  coloured  objects  (e.g. 
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orange  red.  pink,  violet)  should  be  distinguishable  from  each  other,  regardless 
of  their  position  in  the  exploratory  field  (i.e.  dim.  dark  or  bright  illumination 
setting).  Therefore,  the  ultimate  goal  of  the  fusion  of  visible  and  infrared  signals 
is  to  allow  for  adaptive  colour  correction  to  improve  colour  classification  under 
spatially  varying  illumination  intensities.  Due  to  the  limitations  of  the  camera 
used,  the  scope  of  this  work  only  explores  the  integration  of  the  visible  (480  - 
700  inn)  and  near  infrared  spectra  (700-900  nm). 

Innate  in  the  human  visual  system  is  our  ability  to  compensate  for  the  effects 
of  illumination  changes,  allowing  ns  to  perceive  the  colours  of  objects  more  stably. 
This  capability  is  known  as  colour  constancy  [2],  and  is  a  desirable  feature  for 
any  colour  object  recognition  system.  There  were  many  attempts  to  mimic  this 
capability  computationally,  but  most  colour  constancy  algorithms  operate  with 
great  efficacy  only  on  scenes  with  uniform  illumination  condition  [2  .  In  general, 
colour  constancy  algorithms  aim  to  keep  constant  the  computed  colour  of  an 
image  pixel  irrespective  of  the  illumination  present  in  the  scene  [2].  On  the 
contrary,  the  proposed  algorithms  in  this  paper  aim  to  keep  constant  the  posit  ion 
(i.e.  Cartesian  coordinates)  of  the  computed  colour  of  an  image  pixel  in  the 
colour  space,  within  the  confines  of  a  pie-slice  decision  region  assigned  to  it  for  its 
classification.  Within  a  scene,  the  proposed  algorithm  performs  colour  correction 
only  on  the  candidate  pixels  depicting  the  target  colours  to  be  tracked  down; 
the  rest  of  the  colours  in  the  image  remain  unscathed.  We  call  this  technique 
selective  colour  constancy.  The  colour  corrections  are  employed  not  to  improve 
the  appearance  of  the  colours  per  se,  but  with  the  aim  of  classifying  the  target 
colours  more  accurately.  Multispectral  selective  colour  constancy  in  this  research 
is  achieved  by  means  of  a  Fuzzy-Genetic  colour  contrast  fusion  that  adaptively 
enhances  or  degrades  the  colour  tristimulus.  thereby  influencing  the  formation 
of  colours  depicting  the  target  object  within  a  pie-slice  decision  region  in  the 
rg-chroniaticity  colour  space. 


2  Related  Works 

The  use  of  multispectral  imaging  has  come  of  age  to  be  a  viable  alternative  to 
conventional  broadband  monochromatic  or  colour  imaging  cameras  for  a  multi¬ 
tude  of  imaging  applications:  face  recognition  with  different  poses  and  expres¬ 
sions  [3],  geographical  studies  [4,5],  food  processing  industry  [6,7.8.9.10]  and 
medical  imaging  [11,12]. 

Multispectral  imaging  captures  a  wide  range  of  light  reflectance  and  thermal 
radiation  information  spanning  both  the  visible  and  near  infrared  spectra  (non- 
visible).  The  general  technique  employed  in  multi-spectral  imaging  requires  a 
set  of  images,  each  acquired  at  a  narrow  band  of  wavelengths.  Using  a  UV/1R 
camera  with  a  bandpass  filter  (or  interference  filter)  in  front  of  it.  images  could 
be  obtained  at  discrete  spectral  regions  [8].  Vilaseca  et  al.  [13]  introduced  mul¬ 
tiple  pseudo  colour  schemes  in  NIR  to  colourise  these  discrete  spectral  regions. 
On  the  other  hand,  some  studies  employed  the  whole  spectral  range  as  input. 
Menesatti  et  al.  [14]  used  monochromatic  spectrophotometer  for  VIS  to  NIR 
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.spectrum  range  to  analyse  plant  nutritional  status.  Mertens  et  al.  [15]  also  used 
a  combination  of  VIS  to  NIK  spectrum  range  to  analyse  egg  shell  colours  for 
quality  measure  using  the  L*a*b  colour  space.  In  contrast,  Pap  and  Ziljak  stud¬ 
ied  the  separation  of  the  near  infrared  wavelength  area  in  case  of  a  double  image 
reproduction  [1G] . 

All  of  the  aforementioned  undertakings  capitalize  on  the  visible  and  near  in¬ 
frared  spectra  integration  to  extract  useful  patterns  or  signatures  of  objects; 
they  do  not  however,  revive  the  colours  of  the  objects.  There  is  no  study  that 
we  know  of  that  tried  to  combine  both  complementary  spectral  ranges  into  one 
colour  scheme  for  improved  colour  classification  yet.  What  makes  this  research 
unique  is  that  we  propose  an  adaptive  Fuzzy  Genetic- based  visible  and  near 
infrared  spectra  integration  technique  with  adaptive  fuzzy  colour  enhancement 
and  degradation  operators  that  revive  colours  in  very  low  light  conditions.  The 
Genetic  Algorithm  component  of  the  system  fully-automatieally  fine-tunes  all  pa¬ 
rameters  required  by  the  colour  classifiers.  Once  calibration  is  completed,  colour 
classification  is  performed  in  real-time  using  a  novel  variable-depth  colour  look¬ 
up  table  [17,18]. 

3  Illustration  of  the  Problem  Domain 

There  is  a  problem  in  colour  object  classification  when  the  colours  of  an  object 
become  indistinguishable  due  to  very  strong  or  very  weak  illumination  condi¬ 
tions.  and  also  due  to  the  limits  of  the  sensitivity  range  of  the  colour  CCD 
camera.  In  this  case,  it  is  extremely  difficult,  if  not  impossible,  to  estimate  the 
real  colours  of  the  object  merely  from  the  colour  information  captured  from  the 
visible  spectrum.  However,  for  multispectruni  cameras,  it  is  still  possible  to  ex¬ 
tract  further  information  from  the  same  pixel  location  in  the  near- infrared  range. 
Figure  1  shows  an  example  of  a  scene  with  a  very  low  light  condition  and  with 
illumination  gradients.  On  the  other  hand.  Fig.  2  shows  a  near-infrared  image 
of  the  same  scene. 

4  Experiment  Set-Up 

A  Fujifihn  IS  Pro  UV/IR  camera  is  used  for  capturing  all  the  multi-spectral 
images.  The  camera’s  sensitivity  ranges  from  380inn  to  lOOOnrii,  covering  the 
ultraviolet  and  near-infrared  spectra.  Eight  different  filters  were  used  to  con¬ 
trol  the  transmission  of  light:  Peca  902(#70),  904(#87),  90G(#87a),  908(#87b), 
910(#87c),  912(#88a),  914(#89b)  and  918(visible  spectrum).  The  numbers  in 
the  parenthesis  corresponds  to  the  Wratten  optical  filter  label.  The  exposure 
time  used  are  200,  1G7,  125.  100.  77,  G7,  50  and  40  milliseconds  -  this  simulates 
the  different  ambient  lighting  conditions.  The  light  source  is  a  standard  halo¬ 
gen  spotlight  with  a  50- Watt  capacity.  Seven  target  colour  patches,  each  with 
4  representatives  were  classified.  These  colour  patches  were  strategically  placed 
ill  varying  illumination  intensities  to  test  for  relatively  bright,  dim  and  dark 
illumination  conditions. 
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Fig.  1.  An  example  of  a  scene  with  spatially  varying  illumination  intensities  reflecting 
the  visible  spectrum.  The  image  on  the  right  is  an  enlargement  of  the  upper  right 
corner  section  of  t  he  same  scene.  Fnjifihn  IS  Pro,  F/3.5,  1/20  see.  ISO  100,  Peca  918 
Filter 


Fig.  2.  The  corresponding  image  of  the  scene  shown  in  Fig.  1.  reflecting  only  the  near 
infrared  spectrum.  The  image  on  the  right  is  an  enlargement  of  the  upper  right  corner 
section  of  the  same  scene.  Fnjifilm  IS  Pro,  F/3.5,  1/20  sec,  ISO  100.  Peca  904  Filter. 


5  The  Algorithms 

5.1  Colour  Space  and  the  Pie-Slice  Decision  Region 

The  fused  visible  and  NIR  signals  are  scaled  to  be  representable  in  a  modified 
rg-chromaticity  colour  space,  where  the  colour  descriptors  used  [19]  are  suitable 
for  pie-slice  colour  classification  [20]:  rg-Hue  corresponds  to  the  angle,  while  rg- 
Saturation  corresponds  to  the  radius  of  a  colour  pixel 

rg-chroinaticities:  r  -  g  = 

vg-Satuvatiou  =  ^(r  —  0.333)2  +  {g  ~  rg-Hue  =  tan  l('j 
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5.2  Fuzzy  Colour  Contrast  Fusion  (FCCF) 

The  resulting  geometric  shape  of  the  distribution  of  the  fused  colour  pixel  values 
depicting  the  target  colour  objects  is  not  readily  amenable  for  colour  classifica¬ 
tion  using  a  pie-slice  decision  region  in  the  rg-chromaticity  space.  The  drifting  of 
the  colour  pixel  values  in  the  colour  space  is  highly  non-linear  due  to  the  effects  of 
spatially  varying  illumination  intensities  and  this  is  compensated  for  by  a  fuzzy 
colour  processing  algorithm  called  FCCF  19],  in  combination  with  a  Heuristi- 
cally  Assisted  Genetic  Algorithm  (11AGA)  [21]  for  automatically  extracting  the 
parameters  of  the  colour  classifiers. 

The  inputs  to  FCCF  are  the  combined  visible  and  NIR  colour  tristimulus  in 
RGB  form,  as  well  as  the  calculated  rg-Hue  and  rg-Saturation  values.  HAGA 
instructs  FCCF  how  to  operate  on  the  raw  input  colours  by  feeding  it  with  the 
evolved  colour  classifier  parameters.  The  parameters  mainly  consist  of  the  set 
of  optimal  colour  contrast  rules  for  both  the  visible  and  near  infrared  channels, 
the  colour  contrast  enhance  (1)  and  degradation  operations  (2),  and  the  colour 
contrast  constraint  angles  for  the  fused  visible  and  NIR  channels.  Consequently, 
FCCF  returns  the  refined  RGB  values  amenable  for  filial  colour  classification. 
Contrast  Enhance  Operator: 


2 fia2(y)  0  <  Ita(y)  <  0.5 

1-2[1  -iin(yf  0.5</ia(y)<l 


(1) 


Contrast  Degrade  Operator: 


0.5  +  2[/i„(y)  —  0.5]2  0  <  iia{y)  <  0.5 

0.5  -  2(1  -  [,in(y)  +  0.5]2)  0.5  <  ,in(y)  <  1 


{ 


(2) 


Q 


5.3  VIS-NIR  Fusion  Operators 

The  fuzzy  colour  contrasted  near-infrared  signal  is  fused  adaptively  with  the 
visible  colour  tristimulns  according  to  t  he  fusion  operation  range.  The  candidate 
fusion  operators  are  listed  by  Equation  3  where  a  is  one  of  the  colour  components 
of  the  visible  spectrum  (e.g.  R,  C  or  B)  and  (3  is  the  fuzzy  colour  contrasted  value 
of  the  NIR  signal.  For  each  colour  channel  (i.e.  R.G,B),  both  the  fusion  operation, 
fusion  operation  range  and  the  colour  contrast  operator  for  the  NIR  signal  are 
selected  automatically  by  the  HAG  A  algorithm. 


(a)  a  =  a  *  (1  +  /?),  (b)  a  =  a  *  /3,(c)  a  =  cx  —  (3 


(d)  a  =  —  y  -  .  (e)  cv  =  a  +  0,(t)  a  =  0 


(••5) 


a  =  1,  a  >  1 
a  =  |a|,a  <  0 


Colour  Object  Classification  Using  the  Fusion  of  Visible  and  NIR  Spectra 


503 


6  General  System  Architecture 

The  proposed  system  extends  the  colour  classification  system  described  in  21]  to 
operate  both  in  the  visible  and  near  infrared  spectra.  As  depicted  in  Fig.  3.  there 
are  now  three  input  streams:  the  colour  tristirnnlns  from  the  visible  spectrum 
(i.e.  R,G,B),  the  monochromatic  near  infrared  signal  and  the  colour  classifier  (i.e. 
multi-spectrum  FCCF-VCD  classifier).  Initially,  the  colour  tristimuhis  values  are 
reduced  to  a  lower  colour  resolution  according  to  the  Variable  Colour  Depth  Pro¬ 
cessing  component.  Next,  each  processed  colour  channel  value  is  tested  against 
the  corresponding  fusion  operation  range  produced  by  the  colour  classifier.  If  the 
processed  colour  value  falls  within  the  fusion  operation  range,  then  this  signal 
is  fused  together  with  the  fuzzy  contrasted  near  infrared  signal.  Afterwards,  the 
fused  visible  and  near  infrared  signals  are  processed  similarly  as  in  [21  .  Basi¬ 
cally.  the  fused  signals  will  he  fuzzy  enhanced  or  degraded  adaptively  according 
to  it’s  location  in  the  pie-slice  decision  region. 


Fig.  3.  Multi-spectrum  Colour  Processing  Architecture 

6.1  HAG  A  Chromosome  Design  for  the  Multi-spectrum  Colour 
Classifier 

The  chromosome  generally  encodes  the  pie-slice  colour  classifier  parameters  in 
the  modified  rg-chrornatieity  space,  the  fuzzy  colour  contrast  rules,  colour  con¬ 
trast  constraints  and  the  visible  and  NIR  spectra  fusion  operations.  The  chromo¬ 
some  design  is  an  extension  of  the  pure  visible  colour  classifier  described  in  [21]. 
The  schematic  of  the  chromosome  is  shown  in  Fig.  4. 
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Parameter 

Range 

Length 

Incremental  Steps 

Mm  Angle 

0l-  360 

10  bits 

0351 

Max  Angle 

0*'  360' 

10  bits 

0  351 

Mm  Radius 

0  -  1 

10  bits 

0001 

Max  Radius 

0-1 

10  bits 

0.001 

Min  Contrast  Angle 

0*-  360* 

10  bits 

0  351 

Max  Contrast  Angle 

0*-  360* 

10  bits 

0  351 

Parameter 

Range 

Length 

Incremental  Steps 

Red  Contrast  Rule 

-3.00  -  3.99 

6  bits 

0  109 

Green  Contrast  Rule 

-3.00  -  3.99 

6  bits 

0  109 

Blue  Contrast  Rule 

-3.00  -  3.99 

6  bits 

0.109 

Red  Colour  Depth 

5-899 

4  bits 

0.249 

Green  Colour  Oepth 

5-8  99 

4  bits 

0  249 

Blue  Colour  Depth 

5-899 

4  bits 

0  249 

Parameter 

Range 

Length 

Incremental  Steps 

Red  Fusion  Operation 

0-8  99 

5  bits 

0  281 

Green  Fusion  Operation 

0-8  99 

5  bits 

1  0281 

Blue  Fusion  Operation 

0-8  99 

L  5  bits 

0  281 

Red  Fusion  Operation  Range 

1  -  1 

7  bits 

0016 

Green  Fusion  Operation  Range 

-1  -1 

7  bits 

0.016 

Blue  Fusion  Op  ration  Range 

-1  -1 

7  bits 

0016 

Red  Contrast  Rule 

-3.00  -  3.99 

6  bits 

0  109 

Green  Contrast  Rule 

-3.00-3.99 

6  bits 

0  109 

Blue  Contrast  Rule 

-3  00-3  99 

6  bits 

0  109 

Fig.  4.  Chromosome  Design 


6.2  Fitness  Function 


The  evolved  colour  classifiers  represented  by  the  chromosomes  arc  automatically 
graded  using  a  fitness  function  described  in  [21].  The  fitness  fnnction(Eqn.  (4)) 
adaptively  forgives  false  positive  classifications  to  encourage  finding  classifiers 
that  return  high  true  positives.  On  the  other  hand,  it  tries  to  avoid  getting 
trapped  in  local  maxima  by  reducing  rewards  in  eases  where  true  positives  and 
false  positives  are  both  very  low. 


true  positive  pixels  in  target  area 
total  pixels  in  target  area 
false  positive  pixels  in  outside  target  area 
total  pixels  outside  target  area 

fitness  =  1 


1 


1 


-HKy-o.s) 


+ 


1  F4.f.-75(*~0.05) 

1  f  “  100;  0.4) 


(4) 


7  Experiment  Results  and  Analysis 

Fig.  5  illustrates  the  colour  classification  results  using  the  pure  \isiblc  spectrum 
arid  fused  visible  and  N1R  spectra.  As  can  be  seen  from  the  pure  visible  spectrum 
Image  (a),  pink  and  violet  arc  hardly  distinguishable  from  each  other.  On  the 
other  hand,  by  utilising  the  NIK  Image  (b)  for  additional  colour  information 
and  applying  fuzzy  colour  contrast  fusion  and  colour  classifier  optimisation  by 
HAGA,  the  resulting  Image  (c)  dramatically  changed  in  colour  as  compared  to 
the  original  one. 

Wlmt’s  interesting  to  sec  in  the  results  is  the  revival  of  the  colours  of  the  two 
Pink  colour  patches  in  the  centre  of  the  image.  This  is  reflected  by  the  results 
found  in  Image  (e)  -  fused  visible  and  NIR  classification  results,  as  well  as  linage 
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Fig.  5.  Sample  colour  classification  results  for  the  Pink  target  eolonr  patches.  The  * 
labels  identify  the  Pink  targets  and  the  4-  labels  identify  the  Violet  targets  in  image  (a), 
(a)  original  image  under  200ms  exposure  time,  using  Peca  918  filter,  (b)  near- infrared 
image  under  200ms  exposure  time  using  Peca  90S  filter,  (c)  fused  and  fuzzy  colour 
contrasted  image  using  images  (a)  and  (b)  as  inputs,  (d)  colour  classification  results 
using  the  image  in  (c),  red  pixels  depict  true  positive  results  while  yellow  depict  false 
positives,  (e)  fuzzy  colour  contrasted  image  of  the  scene  in  (a),  (f)  colour  classification 
results  using  the  image  in  (e).  red  pixels  depict  true  positive  results  while  yellow  depict 
false  positives. 


Table  1.  Colour  Classification  Result:  Visible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Green) 


Shutter 

Fusion  of  Visible  and  Near  Infrared  Spectra 

IVreeutage  of 

Spew! 

Killer  Type 

Ft  SION  BUST 

BUST 

Improvement 

(ms) 

f  \  IS 

902 

901 

900 

908 

910 

912 

91 1 

RESULT 

Fll.TKK 

Over  the  Yisibh 

200 

0.958 

0.919 

0.803 

0.939 

0.937 

0.950 

0.919 

0.949 

0.956 

910 

(1.261% 

107 

n.9o:i 

0.901 

0.908 

0.9151 

0.914 

0.9.59 

0.962 

0.925 

0.962 

912 

-0.127% 

12ft 

0.005 

0.803 

0.901 

0.750 

0.902 

0.914 

0.735 

0.875 

0.964 

inn 

-0.076% 

100 

0.000 

0.910 

0.921 

0.955 

0.889 

0.958 

0.9-19 

0.96 1 

0.964 

911 

-0.216% 

77 

0.008 

0.898 

0.913 

0.912 

0.925 

0.952 

0  916 

0.932 

0.952 

910 

-1.618% 

07 

0.907 

0.914 

0.932 

0.920 

0.738 

0.956 

0.938 

0.948 

0.956 

910 

1.155% 

50 

0.907 

0.911 

0.933 

0.930 

0.935 

0.919 

0.937 

0.941 

0.914 

902 

-2  179%. 

-10 

0.901 

0.734 

0.885 

0.871 

0.953 

0.922 

0.895 

0.872 

0.953 

908 

-0.771%, 

1  Average 

0.964 

0.891 

0.907 

0.904]0.899 

0.942]  0.907 

0  926 

0.9,56 

910 

-0.837% 
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Table  2.  Colour  Classification  Result:  Visible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Light  Blue) 


Shutter 

Fusion  of  Visible*  ami  Near  Infrared  Sp<*ctra 

Percentage  of 

Speed 

Filter  Ty)*> 

FUSION  REST 

BFST 

Improvement 

(ms) 

viS 

902 

901 1 

900 

908 

910 

912 

91 1 

RESULT 

FILTER 

Over  the  Visible 

200 

0.071 

0.969 

0.967 

0.970 

0.972 

0.967 

0.968 

0.968 

0.972 

908 

0.150% 

107 

0.075 

0.970 

0.973 

0.969 

0.977 

0.975 

0.974 

0.972 

0.977 

908 

0.2-19% 

125 

0.077 

0.973 

0.976 

0.977 

0.975 

0.977 

0.977 

0.975 

0.977 

912 

0.014% 

1041 

0.97s 

0.977 

0.977 

0.977 

0.977 

0.978 

0.979 

0.974 

0.979 

912 

0.085% 

77 

0.078 

0.973 

0.978 

0.970 

0.976 

0.977 

0.976 

0.973 

0.978 

901 

0.037% 

07 

0.080 

0.979 

0.980 

0.981 

0.978 

0.979 

0.977 

0.978 

0.981 

906 

0.114% 

5() 

0.980 

0.979 

0.980 

0.979 

0.980 

0.979 

0.979 

0.978 

0.980 

90S 

-0.003% 

40 

0.071 

0.967 

0.971 

0.9641 

0.970 

0.970 

0.971 

0.969 

0.971 

904 

-0.009% 

Average' 

0.070 

0.973 

0.075 

0.974 

0.970 

0.975 

0.975 

0.974 

0.977 

908 

0.080% 

Table  3.  Colour  Classification  Result:  Visible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Orange) 


Shutter 

Fusion  of  Visible  aiul  Near  Infrared  Sj>ectra 

Percentage  of 

Speed 

Filter  Type 

FI  SION  BEST 

BEST 

Improvement 

(ms) 

VIS 

902 

904 

906 

908 

910 

912 

914 

RESULT 

FILTER 

Over  the  \  isible 

2(X) 

0.708 

0.497 

0.497 

0.497 

0.850 

0.497 

0.558 

0.497 

0.850 

908 

16.692% 

167 

0.9-1 1 

0.497 

0  954 

0.197 

0.497 

0. 197 

0.734 

0  497 

0.954 

901 

1.372% 

125 

0.946 

0.682 

0.497 

0.497 

0.716 

0.497 

0. 197 

0.608 

0.716 

908 

-32.089% 

100 

0.939 

0.8X1 

0.497 

0.561 

0.197 

0.497 

0  497 

0.836 

0.881 

902 

-6.670% 

77 

0.917 

0.497 

0  794 

0.6-12 

0.823 

0.197 

0  197 

0  497 

0.823 

908 

-11  108% 

G7 

0.797 

0.197 

0.497 

0.497 

0.497 

0.497 

0  497 

0.497 

0.497 

902 

-60.583% 

50 

0.618 

0.497 

0.497 

0.497 

0.497 

0.497 

0.497 

0  497 

0.497 

902 

-24.513% 

10 

0. 196 

0.496 

0.496 

0.496 

0.496 

0.496 

0.496 

0  496 

0.196 

914 

0.011% 

Average 

0.795 

0.568 

0.591 

0.523 

0.609 

0.497 

0.534 

0.553 

0.714 

908 

- 1 1 .378% 

Table  4.  Colour  Classification  Result  Visible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Pink) 


Sliiltler 

Fusion  of  Visible  and  Near  Infrared  Spectra 

Percentage  of 

Spe**d 

Filter  Type 

fusion  best 

RES1 

Improven  icut 

(ms) 

\  IS 

902 

901 

!KX> 

908 

910 

912 

914 

RESULT 

FILTER 

Over  the  Visible 

200 

0  701 

0.673 

0.861 

0.641 

0.931 

0.740 

0.781 

0.933 

0.033 

914 

2 1.589% 

167 

0.869 

0.917 

0.850 

0.937 

0.943 

0  763 

0.851 

0.9  19 

0.919 

914 

8.397%* 

125 

0.918 

0.938 

0.903 

0.909 

0.939 

0.891 

0.906 

0.877 

0.939 

908 

-1.033% 

100 

0.927 

0.921 

0.920 

0.950 

t)  939 

0.896 

0.933 

0.927 

0.959 

900 

3.339% 

77 

0.907 

0.928 

0.937 

0.917 

1 1.935 

0.952 

0.928 

0.923 

0.952 

910 

4.795% 

67 

0.948 

0.893 

0.895 

0.910 

0932 

0.957 

0.922 

0.920 

0.957 

910 

0.850% 

50 

0.930 

0.911 

0.895 

0.943 

0.911 

0.935 

0.S98 

0.945 

0.9*15 

914 

1.571% 

10 

0.810 

0.759 

0.847 

0.878 

0.790 

0.839 

0.800 

0.852 

0.878 

906 

•1.280% 

A  verage 

0.881 

0.872 

0.888 

0.891 

o.aitij 

r()  872* 

0.877 

0.916 

0  910 

914 

3  411% 

(f)  -  visible  spectrum  classification  results.  It  is  evident  that  the  true  positives 
increased  in  Image  (e)  after  the  fusion  of  visible  and  NIR  signals  with  FCCF 
and  HAG  A  operations. 

The  experiments  involved  training  the  colour  classifier  using  FCCF-HAGA- 
VCD  which  takes  inputs  from  the  visible  and  NIR  images  at  8  different  shutter 
speeds.  The  visible  and  NIR  images  show  that  the  colour  objects  are  under 
spatially  varying  illumination  conditions.  7  different  filters  were  used  for  taking 
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Table  5.  Colour  Classification  Result:  Visible  Versus  Fusion  of  Visible  and  N1R  Spectra 
(Violet) 


SllUUer 

Fusion  of  Visible  mihI  Near  1  nfmrc^l  Spectra 

PerrenUigr  of 

Speed 

Filter  Type 

l'(  SION  BEST 

BEST 

Improvement 

(ms) 

r  vis 

r  90  2 

904 

IKMi 

910 

912 

914 

lUvSl  FT 

FILTER 

Over  the  Visible 

200 

0.886 

0.9115 

0.868 

0.9118 

0.S95 

0.904 

0.880 

0.918 

0.90 1 

910 

8.153% 

167 

0.8S3 

0.910 

0.947 

0.950 

0.88*2 

0  877 

0.896 

0.950 

0.950 

914 

7.062% 

125 

0  027 

0.9119 

0.853 

0.9H0 

0.839 

0.9*22 

0.921 

0.951 

0.954 

91  ] 

2.880% 

100 

0.9115 

0.950 

0.9111 

0.9117 

0.911 

0  950 

0.9117 

0.929 

0.950 

902 

1 ,562%. 

77 

0.9711 

0.9112 

0.905 

0.900 

0.900 

0  955 

0.917 

0.91*2 

0.965 

904 

-0.815% 

67 

0.901 

0.94H 

0.945 

0.919 

0.9118 

0.910 

0  900 

0  9117 

0.915 

904 

4.598% 

50 

0.1)16 

0  928 

0.91111 

0.9*29 

0.9HU 

0.9111 

0.9115 

0.919 

0.9115 

912 

2.009% 

111 

0.889 

0.851 

0.87) 

0.900 

0.880 

0.851 

0  91S 

0.892 

0.918 

912 

11. 1 90% 

Average 

0.91  1 

0.9211 

0.912 

0.9114 

0.!M)8 

0.921 

0.914 

0.9111 

0.948  1  <MM» 

- - - L _ _ _ . 

3  073% 

Table  G.  Colour  Classification  Result:  V  isible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Red) 


Shut  (er 

Fusion  of  Visible  and  Near  Infrared  Spirt  iu 

Percentage  of 

Speed 

Filter  Type 

FUSION  HI  ST!  BEST 

Improvement 

(ms) 

VIS 

902 

!XMl 

900 

908 

910 

912  |  91 1 

RESULT  |  FILTER 

Over  the  Visible 

200 

0  939 

0.93*2 

r0  933 

0.933 

0.930 

0.942 

0.9311 

(Ml  33 

0.942 

910 

0.277% 

107 

0.950 

0.934 

0.9*20 

0.928 

0.931 

0.933 

0.933 

0.939 

0.939 

914 

-1,832%. 

K  125 

0.908 

0.953 

0.924 

0.959 

0.959 

0.960 

0.940 

0.9*22 

0.960 

910 

-0.802% 

100 

0.963 

0.1)00 

0.951 

0.904 

0.9118 

0.902 

0  911 

0.94  1 

0.902 

910 

-0.113%. 

77 

0.906 

0.900 

0.918 

0.870 

0.948 

0.859 

0.894 

0.890 

0.900 

902 

0.051% 

07 

0.949 

0.892 

0.751 

0.795 

0.747j 

0.919 

0.948 

0.917 

0.949 

""  910  “ 

-0.007% 

50 

0.921 

0.91 1 

0.839 

0.911 

0.881 

0.818 

0.710 

0.707 

0.911 

90  2 

~  -n.7m 

10 

0.822 

0.000 

0.190 

0.014 

0  080 

0.810 

0.08*2 

0.810 

0.810 

910 

-0.696%i 

Average 

0.930 

0. 90*2  j0.KI6[0. 864 

0.878 

0.909 

0  >74 

0.8.83 

0.931 

910 

-0.  180% 

Table  7.  Colour  Classification  Result:  Visible  Versus  Fusion  of  Visible  and  NIR  Spectra 
(Yellow) 


Shutter 

Fusion  of  Visible  and  Near  Infrared  Spectra 

IVrcentage  of 

SimhhI 

Filter  Type 

H  SION  HI  SI 

rest 

Improvement 

(his) 

VIS 

902 

!X)  1 

900 

908 

910 

91*2 

91  1 

RKSl  FT 

KILTER 

Over  the  Visible 

200 

0.943 

0.709 

0.719 

0.811 

0,197 

0.594 

0.497 

0.497 

0,811 

900 

-16.276% 

167 

0.954 

0.5511 

0. 197 

0.  197 

0.700 

0.716 

0. 197 

0.  197 

0.716 

910 

-33.230% 

125 

0.961 

h().  197 

0.497 

0.531 

0.899 

0.497 

0.637 

0.675 

0.899 

!K1H 

-6.906% 

100 

0.965 

0.615 

0.497 

0.703 

0.497 

0.917 

0. 197 

0.497 

0.917 

910 

-5.171% 

77 

0.907 

0.922 

0.958 

0.836 

0.791 

0.766 

0.907 

0.511 

0.1)58 

901 

-0.926% 

67 

0.969 

0.958 

0.679 

0.795 

0.926 

0.497 

0.649 

0.752 

0.958 

fM)2 

1  138% 

50 

0.908 

0.7611 

0.81 1 

0.  197 

0.497 

0.745 

0,197 

0.497 

0.811 

901 

-19.311% 

40 

0.949 

0.838 

0  196 

0.803 

0.926 

0.496 

0.848 

0.814 

0.926 

908 

-2  462%. 

Average 

0.960 

0.736 

0.641 

0.681 

0.717 

0.653 

0.628 

0.596 

0.875 

902 

-9  705%. 

the  NIR  images  with  varying  transmission  characteristics.  For  each  of  the  7  target 
colours,  148  colour  classifiers  wort;  generated  and  compared. 

The  details  of  the  classification  results  can  be  found  in  Table  I.  2,  3.  4,  5,  ()  and 
7.  All  the  results,  including  the  pure  visible  and  the  fusion  of  visible  and  near- 
infrared  signals  were  calculated  using  Eqn.  (4).  The  percentage  of  improvement 
was  calculated  based  on  the  best  result  from  the  fusion  of  signals  (Fusion  Best 
Result  Column)  and  the  best  result  from  the  pure  visible  spectrum  (VIS  column). 
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It  can  bo  seen  from  the  tabulated  results  that  the  classification  of  Pink,  Violet 
and  Light  Blue  target  colours  improved  over  the  pure  visible  approach.  For  the 
rest  of  the  other  colours  no  improvement  was  observed.  We  hypothesise  that 
the  lack  of  improvement  is  due  to  the  insufficient  amount  of  generation  and 
population  size  used  by  the  Genetic  Algorithm.  The  chromosome  size  for  (he 
fused  visible  and  near- infrared  is  double  the  chromosome  size  used  for  the  pure 
visible  approach.  It  can  only  be  deduced  that  the  increase  in  chromosome  size 
should  be  accompanied  by  a  significant  increase  in  generation  and  population 
size.  We  intend  to  test  this  hypothesis  further  in  our  future  work. 

8  Conclusion 

This  research  sets  foot  on  the  fusion  of  visible  and  near-infrared  spectra  for  the 
purpose  of  colour  c  lassifying  objects  at  spatially  varying  illumination  intensities. 
Empirical  results  show  that  the  proposed  integration  process  and  the  accom¬ 
panying  Fuzzy-Genetic  colour  processing  algorithms  ean  revive  colours  that  are 
hardly  distinguishable  from  the  pure  visible  spectrum  image  alone. 
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Abstract.  Combinatorial  auction,  where  bidders  can  bid  on  bundles  of  items, 
has  been  the  subject  of  increasing  interest  in  recent  years.  Although  much  re¬ 
search  work  has  been  conducted  on  combinatorial  auctions,  most  has  focused 
on  the  winner  determination  problem.  A  largely  unexplored  area  of  research  in 
combinatorial  auctions  is  the  design  of  bidding  strategies.  In  this  paper,  we  pro¬ 
pose  a  new  adaptive  bidding  strategy  for  combinatorial  auction-based  resource 
allocation  problem  in  dynamic  markets.  A  bidder  adopting  this  strategy  can  ad¬ 
just  his  profit  margin  constantly  according  to  his  bidding  history,  thus  perceiv¬ 
ing  and  responding  to  the  dynamic  market  in  a  timely  manner.  Experiment 
results  show  that  agents  adaptive  bidding  strategy  perform  very  well,  even 
without  any  prior  knowledge  about  the  market. 

Keywords:  combinatorial  auctions,  resource  allocation,  adaptive  strategies. 


1  Introduction 

The  use  of  computing  power  provided  by  centralized  and  distributed  infrastructures  is 
of  increasing  interest  of  research  in  recent  years  in  computer  science.  Internet  is  an 
example  of  such  infrastructures  where  different  users  (people,  software  agents)  can 
use  the  provided  computational  resources  to  perform  their  own  tasks  [5].  The  resource 
allocation  problem,  that  is,  how'  to  distribute  resources  among  a  group  of  users,  re¬ 
ceives  much  attention  and  becomes  an  important  issue.  Internet  auction  is  a  natural 
choice  to  solve  this  kind  of  resource  allocation  problems,  because  it  allocates  re¬ 
sources  to  the  bidders  who  value  them  most  and  achieves  an  efficient  allocation  of 
resources  from  the  view'  of  economics  [2], 
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Combinatorial  auctions,  where  bidders  are  allowed  to  put  bids  on  hundles  of  items, 
reeeive  much  attention  from  researchers  in  both  computer  science  and  economics  |4|. 
Combinatorial  auctions  can  lead  to  more  economical  allocations  of  resources  than 
conventional  single-item  auctions  when  bidders  have  complementarities*  (substitut¬ 
ability)  among  them.  Such  an  advantage  can  lead  to  an  improvement  of  efficiency, 
which  has  also  been  demonstrated  in  airport  landing  allocation  and  transportation 
exchanges  [9][  1 1  ]. 

There  has  been  a  surge  of  research  interests  in  combinatorial  auctions  in  the  last 
deeade.  The  two  most  widely  studied  problems  are  winner  determination  and  auction 
design.  Winner  determination  problem  is  about  finding  the  optimal  allocation  of  re¬ 
sources  among  a  group  of  bidders.  This  optimization  problem  has  been  proved  to  be 
NP-him\  in  general  case  [10],  and  much  work  has  been  conducted  for  solving  it,  in¬ 
cluding  finding  hoth  optimal  solutions  and  approximate  solutions  [I2][6][16],  Com¬ 
binatorial  auction  design  involves  the  investigation  of  the  design  of  different  auction 
protocols  for  combinatorial  auctions,  such  as  single-round  versus  multi-round,  open- 
cry  versus  sealed-bid,  and  the  use  of  various  bidding  rules  [8][3]. 

A  largely  unexplored  area  of  research  in  combinatorial  auctions  is  the  investigation 
of  bidding  strategies.  As  combinatorial  auctions  are  always  incorporated  with  the 
first-price  sealed  bid  auction  protocol  in  many  applications  [3],  we  are  especially 
interested  in  hidding  strategies  in  this  kind  of  auctions.  In  this  paper,  we  consider  a 
scenario  where  first-price  sealed-bid  combinatorial  auctions  are  employed  to  distrib¬ 
ute  computational  resources  among  a  group  of  users,  and  propose  a  novel  adaptive 
hidding  strategy.  A  bidder  adopting  this  kind  of  strategy  adjusts  his  profit  margin 
from  time  to  time  aeeording  to  his  hidding  history,  thus  perceiving  and  responding  to 
the  dynamic  markets.  Experiment  results  show  that  bidders  with  the  adaptive  strategy 
obtain  high  utilities  in  different  dynamic  markets. 

This  paper  is  structured  as  follows.  Section  2  presents  related  work.  Section  3  pre¬ 
sents  the  combinatorial  auction  model.  Section  4  describes  the  adaptive  bidding  strat¬ 
egy.  Section  5  shows  simulation  results.  Finally  Section  6  concludes  this  paper  and 
highlights  some  future  work. 


2  Related  Work 

Resource  allocation  problem  is  an  important  issue  in  the  area  of  computer  science.  In 
recent  years,  a  lot  of  work  has  been  conducted  for  solutions,  among  whieh  eentrali/ed 
mechanisms  and  distributed  mechanisms  are  two  main  approaches  In  centralized 
mechanisms,  there  is  a  resource  manager  that  decides  how  to  allocate  resources 
among  a  group  of  resource  consumers,  while  in  distributed  mechanisms,  consumers 
coordinate  implicitly  or  explicitly  with  one  another  to  reach  an  agreement  of  the  allo¬ 
cation  of  resources. 

Schw'ind  et  al.  [  14]  attempt  to  solve  the  computational  resource  allocation  problem 
using  multi-round  combinatorial  auctions.  They  study  the  situation  where  bidders 
spend  virtual  currencies,  which  are  ohtained  by  selling  unused  resources,  to  get  ac¬ 
cesses  to  computational  resources  needed  for  accomplishing  their  own  tasks  They 
propose  hidding  strategies  for  two  types  of  bidders:  I)  impatient  bidders,  who  benefit 
from  the  instantaneously  use  of  resources  and  2)  quantity  maximizing  bidders,  who 
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require  high  resource  capacities  hut  have  weak  preferences  regarding  the  timing. 
Experiment  results  show  that  for  the  first  type  of  bidders,  it  is  better  to  bid  aggres¬ 
sively  to  get  fast  accesses  to  resources,  while  the  second  type  of  bidders  had  better  bid 
low  prices  and  keep  on  waiting  for  resources. 

Sui  and  Leung  [15]  also  try  to  employ  multi-round  combinatorial  auctions  to  dis¬ 
tribute  computational  resources  among  a  group  of  users.  They  propose  an  adaptive 
bidding  strategy  for  bidders  in  static  markets  where  the  ratios  of  supplies  to  demands 
of  resources  are  kept  constant  during  the  whole  process  of  the  auction.  A  bidder 
adopting  this  kind  of  strategy  can  adjust  his  profit  margin  from  time  to  time  according 
to  his  bidding  history,  and  finally  adapts  to  the  current  market  environment  even 
without  any  prior  knowledge  about  the  market.  Through  simulations,  they  show  that  a 
bidder  using  the  adaptive  strategy  outperforms  others  using  other  strategies,  and  re¬ 
ceive  high  utilities  when  compared  with  optimal  strategics  in  several  static  markets. 

Galstyan  et  al.  [7]  study  the  resource  allocation  problem  with  a  changing  capacity. 
In  their  work,  each  user  uses  a  set  of  lookup  tables  to  decide  which  resource  to  choose 
and  use  a  simple  reinforcement  learning  scheme  to  record  the  accuracy  of  these  ta¬ 
bles.  A  lookup  table  guides  the  user's  decision  based  on  the  neighbours'  actions  at 
previous  time  steps.  At  the  end  of  each  time  step,  each  user  assesses  the  performance 
of  his  lookup  tables  by  increasing  or  reducing  a  point  of  score,  depending  on  whether 
it  has  correctly  predicted  a  winning  choice.  Experiment  results  show  that  users  can 
adapt  effectively  to  changing  capacities  in  dynamic  markets. 

Schlegel  and  Kowalc/.yk  [  131  propose  a  self-organizing  distributed  resource  alloca¬ 
tion  algorithm.  They  study  the  case  where  multiple  servers  are  providing  identical 
resources  with  a  changing  capacity  to  a  group  of  resource  consumers.  For  each  con¬ 
sumer,  a  decision  on  which  server  his  task  is  executed  is  made  independently  accord¬ 
ing  to  the  predictor  that  is  randomly  chosen  from  a  set  of  predictors.  The  probability 
that  a  certain  predictor  is  chosen  is  increased  if  a  correct  prediction  is  made,  or  is 
decreased  when  a  wrong  prediction  is  made.  Experiment  results  show  that  the  bidder 
using  the  proposed  approach  can  adapt  to  dynamic  markets  and  their  collaborative 
behaviour  achieves  a  good  effect  of  resource  load  balancing. 

3  Model  Description 

A  combinatorial  auction  for  the  computational  resource  allocation  problem  is  as  fol¬ 
lows.  There  are  m  different  types  of  resources  provided  by  a  resource  provider,  e.g.,  a 
server  or  a  grid  platform,  to  a  group  of  n  users.  For  each  type  je  {1,2 . m)  of  re¬ 

source,  there  is  a  capacity  c  that  denotes  the  number  of  units  currently  available.  The 
value  of  c  generally  varies  over  time.  Such  a  market  is  called  a  dynamic  market , 

Each  user  needs  certain  resources  to  perform  his  task,  and  the  maximum  number  of 
units  of  type  j  resources  that  a  user  can  request  for  is  //?. ,  Each  user  ie  {1,2,.,.,/?} 

submits  a  scaled-bid  =  {R,p.(R))  to  the  resource  provider,  where 

Re  { /;,r2, r.  <///.,  1  <  j  <  m  ,  contains  the  number  of  different  resources  that 
he  requests  for,  and  pt(R)  is  a  positive  number  denoting  the  price  he  will  pay  for 
getting  R.  After  receiving  bids  from  all  users,  the  resource  provider  solves  the  winner 
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determination  problem,  that  is,  to  find  the  allocation  maximizing  his  revenue  with  the 
constraint  that  for  each  type  of  resource  j\  the  total  number  of  units  allocated  does  not 
exeeed  its  capacity  c } .  Winning  users  will  pay  their  bidding  prices  to  get  accesses  to 

the  resources  they  bid  for,  perform  their  tasks,  and  then  return  the  resources  to  the 
resource  prov  ider.  We  refer  to  the  process  from  the  beginning  of  bid  submission  to  the 
end  of  resource  return  as  a  round  of  a  combinatorial  auction.  Because  such  computa¬ 
tional  resources  are  reusable,  the  combinatorial  auction  can  be  repeated  for  multiple 
rounds  before  it  is  closed  by  the  resource  provider. 

Before  wc  describe  our  adaptive  bidding  strategy,  we  list  some  assumptions  used  in 
this  paper.  First,  we  assume  that  the  information  available  to  each  bidder  is  his  own 
bidding  information  in  the  previous  rounds  only,  e.g.,  his  previous  bids  and  bidding 
results,  and  any  information  about  other  bidders,  such  as  other  bidders'  previous  bids 
and  bidding  results,  is  not  accessible.  Second,  eaeh  bidder  only  submits  one  bid 
per  round,  which  is  determined  by  the  resource  bundle  he  needs  for  his  current  task. 
Hence  a  bidder  who  won  in  the  previous  round  will  submit  a  (generally)  new  bid, 
while  those  who  lost  continue  to  submit  the  lost  bids.  However,  a  bid  will  be  given  up 
after  having  been  submitted  for  r  consecutive  rounds  and  a  new  bid  with  a  new'  re¬ 
source  bundle  will  be  submitted.  This  simulates  the  fact  that  a  bidder  has  a  limited 
patience  on  waiting. 
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As  described  in  the  above  section,  each  winner  needs  to  pay  the  price  he  has  bid  to  get 
the  resources,  and  each  loser  pays  nothing.  For  a  bid  (ft, />;(ft)),  each  bidder  /  has 
its  own  valuation  r(ft)of  bundle  R.  A  rational  bidder  will  use  a  pt(R)  that  is  less 
than  vf(ft),  otherwise  he  will  get  a  negative  utility  when  winning.  That  is, 
p'(R)  =  (  1-  pmi  )xvj(R) ,  where  pmt  e  [0,1]  is  known  as  bidder  Fs  profit  margin  for 
the  bid  (ft,  pi  (ft)) .  The  utility  of  bidder  i  is  hence: 


n(R)  = 


j ft///,  xv  (ft) 

to 


i  wins 
otherwise 


(I) 


Now,  a  bidder  faces  a  dilemma.  Bidding  with  a  low  profit  margin  generally  increases 
his  winning  opportunity,  but  decreases  his  winning  utility  at  the  same  time.  The  oppo¬ 
site  is  also  true.  If  a  bidder  somehow  is  able  to  get  some  prior  knowledge  about  the 
market  environment,  e.g.,  the  number  of  bidders  competing  for  resources,  he  might 
probably  be  able  to  make  use  of  such  information  to  decide  wisely.  For  example,  a 
bidder  who  knows  that  there  are  more  supplies  than  demands  should  tend  to  use  a 
higher  profit  margin  when  bidding.  However,  as  we  assume  in  this  paper,  in  an  open 
and  dynamic  environment,  such  information  is  usually  inaccessible.  Furthermore,  the 
market  environment  can  vary  from  time  to  time.  How  to  design  an  adaptive  bidding 
strategy  that  can  help  the  bidder  perceive  the  environment  and  respond  to  the  market 
in  a  timely  manner,  even  with  limited  information,  is  a  challenge. 
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4.1  Basic  Concepts 

Before  we  introduee  the  adaptive  strategy,  some  basic  concepts  are  defined. 

Definition  1.  A  bidding  record  of  a  bid  bk  =  (Rh,  pt( Rh ))  of  bidder  i  is  a  tuple 
brh  =  (Rh,vi(Rb),  pm. ,  wh , resh ) ,  where  Rh  is  the  resource  bundle  required  in  this  bid, 
v.  (Rh)  is  bidder  f s  valuation  for  Rh  ,  pmi  is  the  profit  margin  used  by  bidder  i  in  this 
bid,  wh  e  [0,  r]  is  the  number  of  rounds  the  bidder  i  keeps  on  bidding  with  the  bid 
before  the  bid  is  accepted  (  resh  =  1  )  or  given  up  (  resh  =0  ). 

Definition  2.  The  anticipated  utility  uhr(brh)  of  a  bidding  record  brh  is 


»Jbrh)  =  Pul,  x ( resb / ( n’sb  +  «’»)) 


(2) 


Hence,  the  minimum  value  of  uhr(brh)  is  0,  where  the  bid  is  rejected  and  wb  =  r  ; 
while  its  maximum  value  is  pmi ,  if  the  bid  is  accepted  in  the  first  round. 

Definition  3.  The  bidding  history  of  a  bidder  is  the  sequence  of  the  most  recent  X 
bidding  records. 

Therefore,  every  time  when  a  bid  is  accepted  or  given  up,  the  oldest  bidding  record  is 
removed  from  the  bidding  history  and  the  new  bidding  record  is  appended  to  the  bid¬ 
ding  history.  We  will  use  bh  to  denote  the  current  bidding  history.  We  define  the 
age  of  a  bidding  record  as  follows. 

Definition  4.  The  age  of  a  bidding  record  brh  in  b/i  is  the  number  of  times  bh  is 
updated  after  brh  is  appended  to  bh  . 

For  any  bidding  record  in  bh  .  its  age  is  always  between  0  and  a-1  .  No  bidding 
history  contains  any  bidding  record  of  age  older  than  /.. 

In  a  dynamic  market,  information  contained  in  newer  bidding  records  is  more 
valuable  than  that  in  an  older  bidding  record.  We  define  the  weights  of  each  bidding 
record  as  follows: 

Definition  5.  A  weight  function  on  the  age  of  bidding  records  is  a  decreasing  func¬ 
tion  /M  :  {0,1,.,.,/.}  — >  [0, 11  that  maps  the  ages  of  bidding  records  to  the  importance 
of  the  information  contained  in  the  bidding  records. 

The  newer  a  bidding  record  is.  the  better  it  can  reflect  the  current  market  environ¬ 
ment,  and  the  higher  weight  it  is  given.1 

Definition  6.  The  anticipated  utility  uhh  (bh* )  of  the  bidding  history  bh  is  the 

weighted  average  anticipated  utilities  of  bidding  records  in  bh  : 


(3) 


Based  on  definition  6,  we  give  two  notations  of  uhh (bh  1^)  and  uhh (bli  l<<T),  which 

denote  the  weighted  average  anticipated  utilities  of  bidding  records  in  bh  ,  whose 
profit  margins  are  not  less  than  and  not  more  than  a,  respectively. 


A  sample  weight  function  is  fH(a)  =  1  -a/). . 
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!>„)  = 


^—dhrh£  M*l/w.  =  1  Rh  ,v(  ( Rh  l./vn.  ,it.  .rrt,,  >«  ^  ^  ^  ^ 

^  /*  (hrk)xu i  (brh) 

hrhrbh  l/>r,,  *(  Rb,\  (  Rh  ).pm  .v\h  ,rr.\h  ).pm,  J  ,f 


U„)  = 


^  hr^cbh  \hrh  - <  ^  .v,  i  /?h  )./»m,  .w*  .r«A  )./>m, 


fjbrh) 


(4) 

(5) 


4.2  An  Adaptive  Strategy 

The  basic  idea  of  the  adaptive  strategy  is  that  a  bidder  should  continuously  review  and 
revise  his  protit  margin  in  use.  This  proeess  is  called  an  adaptation  of  the  profit  mar¬ 
gin,  through  whieh  a  bidder  aims  to  be  able  to  dynamically  maintain  a  good  profit 
margin  in  response  to  the  ever-changing  market  environment. 


Algorithm  1.  Adaptive  Strategy 

T  pm  -  t).  step  =  #,<$  =  1  and  U  =  0  . 

2:  while  auciion  does  noi  finish  do 

3:  Use  pm  lo  bid  for  the  subscquenl  rounds 

4  if  a  new  bidding  record  hrtl  is  formed  then 

5:  Updale  bit  and  compute  nhr(brt,). 

6:  n-  ithAbrh)  and  pm’  =  pm. 

7.  if  CheckSiepDccreascO  =  true  then 

8‘  Decrease  step  by  y\ 

9:  else  if  Check$teplncrcase( )  =  true  then 

10:  Increase  step  by  y: 

1  I .  end  if 

12:  if  ii  =  0  AND  m’  =  0  then 

13:  pm  =  pm  -  step 

14:  else  if  »  1 0  AND  h'  ^  0  then 

13:  if  //<//  then  pm  =  pm  -  Sx.step  else  if  i/>w'  then  pm  =  pm  +  Sxstep  end  if 

16:  else  if  w  =  0  OR  iT  =  0  then 

17:  Coitipuic  tthhihl^l  ff)  and  u uAhh'ly,) 

1 8:  if  n>yh(bli\ >„)  <  Uhh(bh\<o)  then  pm  =  pm  step  else  pm  =  pm  +  step  end  if 

19  end  if 

20  if  pm  >  pm  then  <5  =  I  else  if  pm  <  pm'  then  f)  =  - 1  end  if 

21:  ii’  =  w 

22:  end  if 

23:  end  while 


We  use  a  0-1  variable  d  to  indicate  the  direction  of  an  adjustment  of  the  profit  mar¬ 
gin:  iff)'  =  1,  then  the  adjustment  is  positive,  otherwise  negative.  In  addition,  we  use  ti 
and  //'  to  denote  the  anticipated  utilities  of  the  most  and  the  second  most  recent  bid¬ 
ding  records.  Finally,  while  pm  denotes  the  current  profit  margin,  /;///'  denotes  the 
profit  margin  before  the  previous  adjustment. 

The  adaptive  strategy  is  illustrated  in  Algorithm  1.  Function  CheekStepDeerease 
(line  7)  and  CheekStepI nerease  (line  9)  cheek  whether  step  needs  to  be  increased  or 
decreased.  We  note  that  if  a  small  value  is  used  for  step*  the  adaptive  strategy  will 
need  a  long  time  to  approach  to  the  new  optimal  profit  margin  when  market  environ¬ 
ment  changes,  but  the  profit  margin  generated  by  the  adaptive  strategy  can  be  more 
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refined.  On  the  other  hand,  if  a  large  value  is  used,  the  profit  margin  ean  be  adjusted 
more  quiekly  when  market  environment  ehanges,  but  the  bidder  might  over  adapt  to 
ehanges.  Therefore,  we  need  to  adjust  the  value  of  step  dynamically  during  the 
proeess  of  the  adaptation. 

The  adaptive  strategy  ean  be  illustrated  as  follows.  At  first,  pm ,  step,  r)'  and  u(  are 
initialized.  During  the  process  of  the  auetion,  the  bidder  reviews  the  value  of  pm 
whenever  a  new  bidding  reeord  is  formed.  To  deeide  how  to  change  pm,  the  bidder 
first  updates  the  bidding  history  and  computes  the  anticipated  utility  //  of  the  latest 
bidding  reeord  (lines  5  to  6).  In  //  and  u'  are  both  0,  the  current  profit  margin  is 
thought  to  be  too  high,  and  is  then  decreased  by  step  (lines  1 2  to  13).  If  neither  u  nor 
u'  is  0,  but  the  previous  adjustment  of  the  profit  margin  (recorded  by  <5  on  line  19)  has 
led  to  a  decrease  of  the  anticipated  utility,  an  adjustment  in  the  opposite  direction  will 
be  made;  otherwise,  an  adjustment  in  the  eurrent  adjustment  direction  will  be  made 
(lines  14-15).  Finally,  if  only  one  of  u  and  u’  is  0,  it  is  not  clear  how  the  profit  margin 
should  be  adjusted,  beeause  an  anticipated  utility  of  0  may  be  eaused  by  many  rea¬ 
sons,  e.g.,  a  low  valuation  of  the  bundle,  rather  than  a  low  profit  margin  used.  In  this 
ease  we  rely  on  the  bidding  history:  If  uhh (bit  I >„)<uhh(bh  !<„),  whieh  means  that 
deereasing  the  profit  margin  will  obtain  a  higher  average  anticipated  utility,  the  bidder 
will  make  a  negative  move,  otherwise  a  positive  move  (lines  16-17). 

Next,  we  will  describe  the  CheekStepDeerease  and  CheckStepInerease  functions  in 
details. 


Algorithm  2.  Function:  CheekStepDeerease 
]•  Compute  mean  =  7^*  pmh1  . 

2:  for  1  =  1  to  K  do  if  I  pmh  -mean  l<  sh'  then  at  =  1  else  oj  =0  end  if  end  for 

3:  return  V  <7/ >^7  AND  col  -\  AND  pm  mean  AND  step  >  a 


4.2.1  Function  I:  CheekStepDeerease 

As  mentioned  above,  the  value  of  step  should  be  adjusted  from  time  to  time  with  the 
hope  that  the  profit  margin  revised  by  the  adaptive  strategy  will  be  more  suitable  to 
the  eurrent  market  situation.  The  function  CheekStepDeerease  determines  when  to 
decrease  the  value  of  step.  We  now  define  a  number  of  notations  as  follows. 

Definition  7.  The  profit  margin  history  pink  is  a  sequence  of /c  profit  margins  used 
in  the  most  recent  k  bidding  records. 

Definition  8.  The  step  history  sh  is  a  sequence  of  k  values  used  as  step  in  the  most 
reeent  k  profit  margin  updates. 

We  use  the  notation  pm  =>  it  to  denote  1)  pm  <  n  and  the  next  adjustment  for  pm  is 
positive;  or  2)  pm  >  n  and  the  next  adjustment  for  pm  is  negative. 

The  function  CheekStepDeerease  is  given  in  Algorithm  2.  We  first  eompute  the 
mean  value  of  the  elements  in  pmh  (line  1),  then  for  each  element  in  ptnh  we  eheek  if 
it  is  close  enough  to  the  mean  (line  2).  To  decrease  step,  we  need  several  conditions  to 
be  satisfied  (line  3).  First,  if  a  significant  number  of  elements  in  pmh  are  fluctuating 
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around  the  mean,  then  we  regard  the  mean  as  an  approximation  of  the  optimal  profit 
margin  in  the  current  market  environment.  Second,  if  the  last  element  in  pmh  is  close 
enough  to  the  mean,  and  pm  => mean,  then  the  optimal  profit  margin  can  be  ap¬ 
proached  with  higher  accuracy  if  step  is  decreased.  Finally,  step  cannot  be  too  small 
(smaller  than  a).  If  all  these  conditions  hold,  the  function  returns  true 


Algorithm  3.  Function:  CheekSteplnerease 

I .  t  leg  Move  =  0,  posMove  =  0. 

2:  for  i  =1  lo/t-l  do 

3  if  pmh  <  pmh  ’  then  neg\tove+ +  else  if  pmh  >  pmh  then  pos\tove++  end  if 

4  end  for 

5:  return  (  negMove  >  /  AND  c>  =  -l  AND  step<ft) 

OR  (  posMove  >/  AND  <5  =  1  AND  step  <  ft) 


4.2.2  Function  II:  CheekSteplnerease 

The  function  CheekSteplnerease  given  in  Algorithm  3  determines  when  to  increase 
step.  We  count  the  numbers  of  positive  and  negative  moves  made  in  the  profit  margin 
history,  respectively  (lines  2-4).  Note  that  it  never  happens  that  pmh1  =  pmh1*1  because 
by  Algorithm  1,  a  positive  or  negative  move  is  always  made  when  a  new  bidding 
record  is  formed  If  there  are  many  negative  moves  in  the  profit  margin  history  (more 
than  %)  and  the  next  move  of  the  profit  margin  is  negative  (first  part  of  line  5),  we 
believe  that  the  market  environment  is  becoming  more  competitive  for  resource  con¬ 
sumers  and  the  bidder  should  increase  the  value  of  step  to  adapt  to  the  new  market 
quickly.  Similarly,  step  should  also  be  increased  when  there  are  many  positive  moves 
in  the  profit  margin  history  (more  than  /)  and  the  next  move  of  the  profit  margin  is 
positive  (second  part  of  line  5).  In  either  ease,  if  the  threshold  value  ft  to  stop  increas¬ 
ing  step  is  not  reaehed.  the  function  will  return  true. 

5  Experiment  Evaluation 

To  evaluate  the  performance  of  the  adaptive  strategy,  we  conduct  two  sets  of  experi¬ 
ments.  In  the  first  set  of  experiments,  we  try  to  identify  the  optimal  profit  margins  in 
different  market  environments.  In  the  second  set  of  experiments,  we  show'  that  the 
adaptive  strategy  outperforms  the  random  strategy  and  its  performance  is  very  close 
to  an  oraele  strategy  that  makes  use  of  market  information  that  is  assumed  to  be  inac¬ 
cessible.  In  addition,  we  also  illustrate  the  typical  adaptation  processes  of  the  profit 
margin  in  dynamic  markets  in  the  sceond  set  of  experiments 

5.1  First  Set  of  Experiments:  Estimation  of  the  Optimal  Profit  Margin 

5.1.1  Experiment  Setup 

In  the  first  set  of  experiments,  our  aim  is  to  find  the  best  profit  margins  in 
markets  with  different  supply/ demand  ratios.  This  is  done  by  first  testing  a  set  of  fixed 
strategies  that  use  fixed  profit  margins  throughout  the  whole  proeess  of  the  auction. 
We  use  19  different  fixed  strategies  with  profit  margins  pmx  =  0.05,  pmx=0.  10, 
pmy  =  0. 1 5 pm  1 9  =  0.95. 
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Wc  find  the 
best  fixed  strategy 
for  a  particular 
type  of  market  as 
follows.  For  each 
fixed  strategy,  we 
repeat  the  combi¬ 
natorial  auction  for 
100  runs,  with 
each  run  consist¬ 
ing  of  500  rounds 
of  combinatorial 
auctions.  Follow¬ 
ing  the  previous  work  [1][I4],  in  each  run,  we  have  one  testing  bidder  using  that  fixed 
strategy  while  others  are  bidding  with  their  true  valuations.  After  100  runs,  the  accu¬ 
mulated  utilities  obtained  by  the  testing  bidders  using  different  fixed  strategics  are 
compared,  and  the  best  performing  fixed  strategy  is  identified. 

The  experiment  settings  are  as  follows.  A  group  of  /i  =  60  users  compete  for  m  =  4 
types  of  resources  provided  by  a  resource  provider.  For  each  bidder,  the  numbers  of 
units  that  he  can  request  for  different  resources  are  integers  randomly  drawn  from 
uniform  distributions  [0,  3],  [0,  2],  [0,  2]  and  [0,  1].  His  valuations  for  single  unit  of 
different  resources  are  real  numbers  randomly  drawn  from  uniform  distributions  [3, 
61,  [4,  8],  [4,  8]  and  [6,  I  Of  For  a  resource  bundle  R  which  contains  more  than  one 
type  of  resources,  a  synergy  seed,  syn(R\  is  randomly  drawn  from  a  uniform  distribu¬ 
tion  [-0.2,  0.2],  and  his  valuation  for  that  bundle  is  the  product  of  sum  valuations  of 
individual  resources  and  1  +  s\n(R).  Positive  synergy  seed  means  complementarities 
among  resources  and  negative  synergy  seed  means  substitutability  among  them. 

Table  1  summarises  the  parameters  used  in  experiments. 

5.1.2  Experiment  Results  and  Analysis 

The  simulation  results  of  the  first  set  of  experiments  are  shown  in  Fig.  1.  Each  curve 
represents  a  certain  market  environment,  and  the  accumulated  utilities  of  the  testing  bid¬ 
ders  using  19  different  fixed  strategics  are  compared.  We  can  sec  that  the  less  competi¬ 
tive  the  market  is  for  resource  consumers,  the  higher  the  value  of  the  best  fixed  profit 
margin  will  be.  For  example,  in  the  market  with  a  supply/demand  ratio  of  1.2:1,  the  best 
fixed  profit  margin  is  0.95,  while  in  the  market  with  a  .supply/demand  ratio  of  0.5:1  the 
best  fixed  profit  margin  is  0.15.  This  agrees  with  our  expectation  that  in  a  market  less 
competitive  for  bidders,  it  is  better  to  use  a  high  profit  margin,  and  vice  versa. 

Based  on  the  results  in  Fig  I,  we  use  a  regression  method  to  interpolate  these  op¬ 
timal  values  to  approximate  the  optimal  profit  margins  in  different  market  environ¬ 
ments.  The  result  is  shown  in  Fig.  2.  The  red  dots  are  best  performing  fixed  profit 
margins  obtained  from  Fig.  1,  and  are  regarded  as  sample  points.  We  use  a  piecewise 
function  opt(rf)  to  fit  these  samples: 


Table  I.  Parameters  used  in  the  experiments 


Parameter 

Value  Used 

Description 

T 

3 

Maximum  lost  round 

A 

5 

Length  of  a  bidding  history 

*1 

0.05 

Initial  value  of  pm 

0 

0  1 

Initial  value  of  step 

y 

2 

Amount  of  decrease  or  increase  for  step 

K 

10 

Length  of  a  profit  margin  history 

<P 

6 

See  Algorithm  2 

a 

0.0! 

Threshold  to  stop  decreasing  step 

X 

7 

See  Algorithm  3 

0 

0.1 

Threshold  to  stop  increasing  step 

opt(rf)  =  \ 


rf  <p 
rf  2  p 


(6) 
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The  result  of  the  regression  is  that  a  =  0.0001334  ,  b  =  2561.574  ,  c  =  0.1645  , 
(/  =  0.05  and  p  =  1.106062,  which  is  shown  as  the  blue  line  in  Fig.  2.  We  can  see 
that  the  blue  line  fits  our  samples  very  well,  and  when  talking  about  a  certain  ty^e  of 
market  denoted  by  rf  we  will  use  the  function  value  as  the  optimal  profit  margin/ 

We  can  imagine  a  bidder  who  somehow  has  access  to  equation  (6)  and  the  current 
if  can  always  make  a  very  good  decision  on  what  profit  margin  to  use.  He  is  actually 
using  a  bidding  strategy  that  is  practically  impossible.  We  shall  refer  to  such  a  bidding 
strategy  an  oracle  strategy,  which  will  be  used  as  a  benchmarking  strategy  in  the 
second  set  of  experiments  to  evaluate  the  performance  of  the  adaptive  strategy. 

5.2  Second  Set  of  Experiments:  Performance  of  the  Adaptive  Strategy 
5.2.1  Experiment  Setup 

In  this  section,  we  compare  the  performance  of  the  random  strategy,  the  oracle  strate¬ 
gy  and  the  adaptive  strategy.  The  random  strategy  is  a  strategy  that  a  random  profit 
margin  is  used  for  each  bidding  record.  The  oraele  strategy  is  a  strategy  that  the  bid¬ 
der  is  privileged  and  has  complete  knowledge  of  the  current  market  environment, 
which  is  denoted  by  the  value  of  if  and  always  uses  the  best  profit  margin  for  the 
latest  market  given  by  equation  (6)  when  bidding.  Note  that  in  equation  (6),  the  max¬ 
imum  profit  margin  can  be  generated  is  0.95.  Therefore,  we  also  set  up  an  upper 
bound  of  0.95  on  the  profit  margins  generated  by  both  strategies/ 


Fig.  1.  Utilities  of  testing  bidders  using  19  Fig.  2.  Regression  curve  of  the  optimal  profit 

different  fixed  sirategies  in  markets  with  margin 

d liferent  supply/demand  ratios 


Here,  an  exponential  function  is  used  as  the  left  part  of  the  regression  curve,  and  actually,  it 
does  not  matter  too  much  if  we  use  other  functions  This  is  because  in  the  second  set  of  expe¬ 
riments,  we  never  use  equation  (7)  to  estimate  the  optimal  profit  margin  of  the  market  whose 
if  falls  out  of  [0.5,  1.2],  and  the  estimated  optimal  profit  will  noi  vary  much  if  other  fitting 
functions  are  used. 

1  Actually,  setting  this  upper  bound  does  not  affect  the  performance  of  the  adaptive  strategy. 
This  is  because  without  iliis  constraint,  when  the  optimal  profit  margin  is  a  value  infinitely 
close  lo  1,  the  profit  margin  generated  by  Ihe  adaptive  strategy  is  also  very  close  to  1,  and  the 
bidder  using  the  adaptive  strategy  does  nol  losing  utility  ai  all. 
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A  dynamic  market  is  one  in  which  the  value  of  //changes  by  the  time.  We  consider 
three  types  of  dynamic  markets,  which  are  shown  in  the  left  column  of  Fig.  3.  The 
first  one  is  that  the  capacity  factor  rf changes  in  a  linear  pattern  and  keeps  as  constant 
alternatively,  the  second  one  is  that  rf  changes  in  a  linear  pattern,  and  the  last  one  is 
that  //changes  in  a  cosine  pattern.  A  run  is  now'  composed  of  900  rounds.  Other  set¬ 
tings  are  the  same  as  those  in  section  5.1.1. 

5.2.2  Experiment  Results  and  Analysis 

The  middle  column  in  Fig.  3  shows  the  simulation  results  of  the  second  set  of  experi¬ 
ments.  The  utilities  obtained  by  the  bidders  using  the  adaptive  strategy  (AS),  the 
oracle  strategy  (IS)  and  the  random  strategy  (RS)  are  compared.  We  can  see  that 
the  bidders  using  the  adaptive  strategy  perform  fairly  well  they  outperform  the  ran¬ 
dom  strategy  and  obtain  good  utilities  compared  to  the  oracle  strategy  in  all  dynamic 
markets. 

We  also  show  in  the  right  column  of  Fig.  3  some  typical  adaptation  processes  of 
the  profit  margin  in  a  single  run  in  different  dynamic  markets,  The  red  lines  indicate 
the  optimal  profit  margins  given  by  equation  (6)  and  the  blue  lines  show  the  profit 
margins  used  by  the  adaptive  strategy.  We  can  sec  that  the  profit  margin  used  by  the 
adaptive  strategy  is  close  to  the  optimal  profit  margins  given  by  equation  (6).  This 
means  that  the  bidder  using  the  adaptive  strategy  is  capable  of  adapting  to  different 
dynamic  markets.  In  addition,  the  adaptation  is  timely,  even  when  the  optimal  profit 
margin  changes  sharply,  e.g.,  the  change  of  the  profit  margin  shown  in  Fig.  3.c. 


(g)  Dynamic  Market  111 


(h)  l  t  ilitics 


(i)  An  adaptation  process 


Fig.  3.  Performance  and  Adaptation  Process  of  the  Adaptive  Strategy  in  Different  Dynamic 
Markets 
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6  Conclusions  and  Future  Work 

In  this  paper,  we  propose  a  new  adaptive  bidding  strategy  for  combinatorial  auetions- 
based  resource  allocation  problem  in  dynamic  markets.  The  bidder  adopting  this  strat¬ 
egy  can  adjust  his  profit  margin  from  time  to  time  according  to  his  bidding  history 
and  thus  perceive  and  respond  to  the  changing  market  environment.  Through  simula¬ 
tions,  we  show  that  I)  the  adaptive  strategy  performs  fairly  well  compared  to  the 
random  strategy  and  the  oracle  strategy  in  different  dynamic  markets.  2)  the  bidder 
using  the  adaptive  strategy  can  obtain  high  utilities,  even  without  any  prior  knowledge 
about  the  market.  3)  the  bidder  using  the  adaptive  strategy  is  capable  of  adapting  in 
dynamic  markets  and  responds  in  a  timely  manner. 

There  are  some  points  for  our  future  work.  First,  we  assume  in  this  paper  that  the 
computational  resources  to  be  auctioned  are  reusable.  There  are  also  many  applica¬ 
tions  where  the  auctioned  resources  are  non-reusable.  Next  step,  we  are  going  to  study 
the  adaptive  behaviour  in  such  type  of  auctions.  Second,  in  this  paper,  we  only  con¬ 
sider  the  situation  where  the  supplies  of  resources  vary  gradually  over  time,  and  the 
effectiveness  of  the  adaptive  strategy  in  markets  with  abrupt  changes  is  not  yet 
known.  In  the  future,  we  intend  to  study  the  market,  where  both  the  number  of  bidders 
and  the  capacities  of  resources  can  change.  Finally,  from  the  simulation  results  we  can 
see  that  although  the  adaptive  strategy  performs  well,  there  is  still  space  for  improve¬ 
ment.  We  are  going  to  explore  the  influences  of  different  parameters  on  the  perfor¬ 
mance  of  the  adaptive  strategy. 
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Abstract,  “The  only  thing  constant  is  change.”—  Ray  Kmc  (Founder  of 
McDonald  s).  Self-organizing  neuro-fuzzy  machines  are  maturing  in  their 
online  learning  process  for  time-invariant  conditions.  To.  however,  max¬ 
imize  the  operative  value  of  these  self-organizing  approaches  for  online- 
reasoning,  such  self-sustaining  mechanisms  must  embed  capabilities  that 
aid  the  reorganizing  of  knowledge  structures  in  real-time  dynamic  envi¬ 
ronments.  Also,  neuro-fuzzy  machines  are  well-regarded  as  approximate 
reasoning  tools  because  of  their  strong  tolerance  to  imprecision  and  han¬ 
dling  of  uncertainty.  Recently,  Tan  and  Qnek  (2010)  discussed  an  on¬ 
line  self-reorganizing  neuro-fuzzy  approach  called  SeroFAM  for  financial 
time-series  forecasting.  The  approach  is  based  on  the  BCM  theory  of  neu¬ 
rological  learning  via  metaplasticity  principles  (Bionenstock  et.  al..  1982). 
which  addresses  the  stability  limitations  imposed  by  the  moiiotonic  be¬ 
havior  in  liebhian  theory  for  online  learning  (Rochester  et  al..  1956)  In 
this  paper,  we  examine  an  adapted  version  called  iSeroFAM  for  interval¬ 
forecasting  of  financial  time- series  that  follows  a  computational  efficient 
approach  adapted  from  Lalla  et  al.  (2008)  and  Carlsson  and  Fuller  (2001) 
An  experimental  proof-of-concept  is  presented  for  interval-forecasting  of 
80  years  of  Dow  .Jones  Industrial  Average  Index,  and  the  preliminary 
findings  are  encouraging. 

Keywords:  neuro-fuzzy.  fuzzy  associative  learning,  online-learning, 
online-reasoning,  self-organizing,  self- reorganizing,  evolving,  time- 
variant,  time-varying,  BCM,  bionenstock  cooper  munro.  sliding  thresh¬ 
old,  synaptic  plasticity,  meta-plasticity,  dissociative,  anti-hebbian, 
interval- forecasting. 


1  Introduction 

In  soft-computing  sciences,  online  hybrids  of  neuro-fuzzy  computing  such  as 
SAFIS  [19],  Simpl.eTS  [3],  eTS  [2],  DENFIS  [13],  EFuNN  [12],  SOFNN  [15], 
and  DFNN  [26],  are  gaining  popularity  as  cost-effective  tools  that  can  exploit 
tolerance  for  imprecision  [27].  Neuro-fuzzy  departure  from  the  precision  arts 
provide  natural  leeway  for  uncertainty  and  vagueness  [23],  which  are  ty  pically 
unavoidable  in  nmny  real-world  forecasting  problems.  Extensive  reviews  of  their 
proper!  ics  are  covered  in  [17,16,1 1] . 
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Online  neurofuzzy  learning  lias  been  studied  from  two  perspectives:  l)  time - 
invariant ,  or  2)  time-variant,  depending  upon  the  characterization  of  their  un¬ 
derlying  system  dynamics,  and  the  duration  of  the  temporal-space  involved.  [22] 
discusses  how  for  time-invariant  problems  with  little  or  no  temporal  variation, 
neu.ro- fuzzy  machines  can  effectively  self- organize  to  learn  out  a  final  structure, 
where  all  data  should  he  experienced  with  equal  emphasis.  If  applied  under 
time- variant  conditions,  such  a  self-organizer  would  actually  average  out  the  ef¬ 
fects  of  the  time- variance  to  obtain  a  middle-ground  solution  that  would  be  a 
structural  mix  of  obsolete  and  new  information.  Over  time,  the  structure  carries 
increasingly  redundant  information,  and  impairs  the  currency  of  results. 

To  manage  more  complex  time-variant  datasets  that  exhibit  regime  shifting 
properties,  nenro-fuzzy  machines  need  to  continuously  selj-reorganizc  their  in¬ 
ternal  structures  to  attune  towards  these  pattern  shifts  in  evolving  data  streams. 
This  strong  distinction  in  learning  objectives  is  vital  for  modeling  decision  en¬ 
vironments  with  changing  characteristics,  and  has  been  widely  discussed  in  ad¬ 
vanced  signal  processing  [8,9].  Generally,  this  necessitates  increasing  bias  on 
recent  data  to  identify  persistent  patterns,  relative  to  transient  ones.  To  ac¬ 
count  for  information  decay,  forgetting  factors  [10],  exponential  gain  functions 
[24]  or  adaptive  training  schemas  [21]  can  be  used  to  increase  emphasis  on 
more  recent  data  experiences  ([11  pp.  222).  As  such,  Tan  and  Quek  [22]  de¬ 
scribed  a  tailored  focus  towards  online-reasoning  rather  than  online- learning, 
as  self- reorganizing  machines  would  generally  serve  transient  reasoning  pur¬ 
poses.  Using  self-reorganizing  approaches  to  handle  time-variance  can  be  es¬ 
pecially  relevant  in  financial  time-series  forecasting,  as  shown  in  [22]  and  later  in 
Section  3. 

The  second  consideration  for  review  is  to  enable  interval- forecasting  through 
a  self-reorganizing  approach.  For  many  forecasting  research,  the  idea  is  to  train 
a  model  that  can  determine  an  accurate  prediction  for  comparison  to  derive 
the  lowest  mean  squared  error  and  highest  correlation  scores.  From  a  technical 
point  of  view,  this  approach  makes  sense.  Accurate  point-based  forecasting  is 
important,  and  is  a  valuable  indicator  for  explicitly  highlighting  best  fit  trends. 
However,  in  truth,  it  is  difficult  to  transfer  the  ownership  of  decision  risk  onto 
these  forecasting  computational  models.  There  is  no  certainty  in  forecasting. 
Computational  forecasting  is  about  managing  uncertainties  and  improving  the 
decision  making  process,  but  ultimately  they  are  only  decision  support  tools. 
For  this,  we  are  concerned  with  prediction  of  an  interval- forecast  that  can  align 
nenro- fuzzy  reasoning  to  the  concept  of  volatility  risk  and  uncertainty. 

This  paper  proposes  iSeroFAM,  a  modified  interpretation  of  SeroFAM  [22], 
which  is  computationally- based  on  the  BCM-theory  of  meta-plasticity  for  on¬ 
line  self-reorganizing  fuzzy- associative  learning.  Here,  the  objective  is  to  realize 
interval- forecasting  capabilities  as  conceptualized  by  Carlsson  and  Fuller  [6]. 
Section  2  provides  a  high-level  overview  of  the  learning  approach.  The  paper 
focuses  on  Section  3  that  examines  the  experimental  proof-of-coneepts  for  self- 
reorganizing  interval- forecasting  using  iSeroFAM. 
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2  iScroFAM:  Self-reorganizing  Fuzzy  Associative 
Machine  with  Interval-Forecasting 

The  proposed  iSeroFAM  is  an  extension  of  SeroFAM  [22].  as  means  to  exploit 
its  self-reorganizing  capabilities  for  interval-forecasting.  It  is  an  online  neural 
con  nect  ion  ist  construct  with  five  neuronal-layers  (see  topology  in  Fig.  1).  The 
actual  input  and  output  signals  at  any  time  f  are  given  as  crisp  vectors  X(t)  = 
[x\,  •  •  • .  •  •  •  xn]f  and  Y(t)  =  [ij\ ,  •  •  •  ijj.  •  ■  ■  ym\ 1  .  and  the  symbols  used  in  Fig. 
1  denote:  //  as  no.  of  input  features;  m  as  no.  of  output  features:  P,  as  no.  of 
membership  functions  (MFs)  for  .r,:  Qj  as  no.  of  membership  functions  (MFs) 
for  yj\  L  as  no.  of  fuzzy  premise  nodes:  as  the  input  sensor  for  xt.  where 

!</<//;  ILp?  as  the  pith  MF  for  Xj.  where  I  <  /),  <  Pt:  /!/  as  the  /tli  fuzzy 
premise  node,  where  1  <  /  <  L\  P/,(/  as  the  fuzzy  rule  link  between  Af  and 

OL\?j  as  the  <jjth  MF  for  tjj.  where  1  <  qj  <  Qj ;  and  0^1  as  the  output 
actuator  for  yj,  where  1  <  j  <  w. 

Similarly,  iSeroFAM  operates  by  interleaving  reasoning  (testing)  and  learning 
(training)  events,  and  the  BCM  learning  process  is  described  in  detail  in  [22].  In 
brief,  a  rate-based  Hobbian  modification  in  for  nile  learning  is  given  as  eqn.  (I): 

in  =  X  •  <t>Hebb  =  J*  •  V  ( I ) 

where  x  and  y  are  the  pre-synaptic  and  past-synaptic  signals  of  a  neuron.  How¬ 
ever.  it  is  not  biologically  plausible  [7],  since  it  implies  that  the  synaptic  weights 
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Fig.  1.  Neuronal  connections  and  layers  for  reasoning  and  learning  events  in  iSeroFAM 
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m 


Fig.  2.  Computational  synaptic  plasticity  using  the  sliding  threshold,  0m 


would  reach  arbitrarily  large  values  over  time.  By  acting  independently  at  each 
synapse.  Hebbian  plasticity  gains  great  power,  but  its  monotonic  behavior  causes 
this  stability  problem  [18],  which  makes  it  especially  less  useful  for  online  com¬ 
putational  learning. 

The  excitatory  drive  to  a  neuron  has  to  be  tightly  regulated  in  an  associative 
and  dissociative  manner  (Hebbian  and  anti-Hehbian)  to  prevent  saturation,  oth¬ 
erwise  information  will  be  lost  and  no  selectivity  will  develop  [4].  Following  which, 
[4]  demonstrated  that  by  floating  a  sliding  threshold  0m  as  a  function  of  the  av¬ 
eraged  activity  of  the  cell,  it  could  overcome  problems  of  runaway  excitation 
and  explain  neural  learning  mechanisms.  iScroFAM  applies  a  discrete  computa¬ 
tional  form  of  the  non-linear  BCM  activation  function  <Pbcm  ~  y(y  —  0m(t))t  as 
compared  to  the  Hebbian  activation  function  (pHcbb  =  y  as  depicted  in  Fig.  2. 

With  reference  to  Fig.  1,  the  discrete  update  to  the  potential  of  the  rule  link 
Ri  q,  between  the  premise  node  A[  and  the  output  fuzzy  node  OLq ,  at  time  t  is 
can  be  written  as  eqn.  (2): 


Jlomosynaptic  LTD  (  —  ve)  H eterosynaptic  LTD 

=4>Uu,q(t),0l,q(t  -tyfl.qit))-  -  1)  (2) 

' - V - ' 

Hamosynaptic  LTD  (+ve) 

where:  Piq  is  the  potential  of  R[  q  :  fjq  is  the  pre-svuaptic  signal  produced  by 
Ai  with  a  gaussian  function;  is  the  post-synaptic  signal  produced  by  y  in 

OLq  with  a  gaussian  function;  9i}(]  is  the  sliding  threshold  based  on  /a?  ;  c  is 
the  uniform  decay  given  by  c  =  (1  —  A):  and  A  is  the  forgetting  factor  [22].  The 
first-half  of  eqn.  (2)  forms  the  basis  for  homosynaptic  long-term  potentiation 
(LTP)  [5]  and  long-term  depression  (LTD)  [20]  depending  on  sign(<j>),  while  the 
second-half  explains  exponential  decay  via  heterosynaptic  LTD  [1]. 

2.1  Computation  of  Interval- Forecasts 

The  interval- forecast  is  conceptually  based  on  the  possibilistic  variability  mea¬ 
sure  described  by  Carlsson  and  Fuller  [6].  To  illustrate,  first  consider  the  output 
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having  three  fuzzy  membership  functions  with  centroids  c\(j),  (‘2 (j)-  and 
C3 (j)  as  shown  in  Fig.  3.  Assume  input  vectors  A  and  B  create  two  unique  rea¬ 
soning  output  spaces  shown  in  Figs.  3(a)  and  3(b)  respectively.  Note  that  both 
reasoning  spaces  will  defuzzify  to  a  same  value,  even  though  their  possibilistic 
spread  about  the'  central  value  is  differs. 

Ceterus  paribus,  the  reasoning  for  input  vector  A  reflects  a  higher  degree 
of  confidence  relative  to  input  vector  B.  To  quantify  this  variance.  Lalla  el 
al.  (2008)  [14]  examined  a  eentre-of-gravity  (COC)  variance  that  was  compu¬ 
tationally  friendlier  measurement  than  the  mathematically  derived  possibilistic 
variance  by  Carlsson  and  Fuller  [6].  Here,  an  niean-of-maxinia  (MoM)  variance 
is  implemented  in  Layer  5  of  iSeroFAM  shown  in  Fig.  1.  The  bth  activation  relay 
/v  of  iSeroFAM  is  modified  to  a  MoM  defuzzification  as  shown  in  eqn.  (3), 
where  is  the  centroid  of  the  output  fuzzy  node,  OlJtf'f: 


yvu)  =  fVU)(QwU) 


JVU) 


JVU) 

Qj 


)  = 


X'Qj 
^1=  1 


U)  JVU) 


l<h 


Q, 


Qj  JV(j) 


532. 


(3) 


Qj 


In  addition  to  the  single  crisp  forecast  represented  by  ov  L).  the  MoM  variance, 
is  defined  as  a  by-product  computation  at  the  defuzzification  layer  5  based 
on  eqn.  (4): 
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Fig.  3.  Reasoning  output  spaces  from  iSeroFAM 
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Based  on  ov^  and  u;1  an  interval- forecast  tuple  with  a  lower-bound  and 
upper-bound  decision  range  can  be  computed  as  shown  in  eqn.  (5),  where  k  is 
the  interval-multiplier  that  controls  the  range  of  the  interval- forecast: 

Dv^  =  [ov{j)  -k-u)VU\ov^ +k-wv{j)]  (5) 


3  Proof-of-Concept 

The  proof-of-concept  experiments  on  iSeroFAM  using  real-world  financial  time- 
series  data  that  is  based  on  the  Dow  Jones  Industrial  Average  (DJIA)  index. 
About  eighty  years  of  daily  index  values  was  collected  from  the  Yahoo!  Finance 
website  on  the  ticker  symbol  uADJIAn  for  the  period  2nd  Jan  1930  to  31.s£  Dec 
2009,  which  provided  20,097  data-poirits  for  the  experiment.  Fig.  4  shows  the 
movement  of  the  index  values  with  a  time- variant  exposure  to  the  numerical 
range  [41.22  14164.53].  The  main  discussion  will  focus  on  the  trajectory  shifts 
of  the  index  values  that  are  especially  rough  after  the  1980s,  which  is  more 
noticeable  from  the  increasing  volatility  in  daily  differences  shown  in  the  bottom 
half  of  Fig.  4.  For  the  following  experiments,  the  parameters  as  explained  in  [22] 
are:  G  =  60  days,  p  —  0.8,  b  =  5.0,  and  zmax  =  40. 


15000 


1930  1940  1950  1960  1970  1980  1990  2000  2009 

Observation  Time  (T) 

Fig.  4.  Dow  Jones  Industrial  Average  daily  index:  02  Jan  1930  31  Dec  2009  (80  years) 


The  analysis  proceeds  with  an  online  precision- forecast  of  the  DJIA  index  val¬ 
ues  using  input  vector,  A" ( T )  =  [ m(T  —  4),  m(T  -  3),  m(T  —  2).  rn(T  -  1),  r n(T)] 
and  output  vector,  Y(T)  =  [???(T-i-l)].  where  in  is  the  absolute  value  of  the  DJIA 
index.  Table  1  presents  the  experimental  results.  In  this  step-wise  forecasting- 
task-  iSeroFAM  performs  with  an  overall  NDEI  =  0.0282  using  an  average  of 
19.4  rules  with  a  PEARSON  correlation  of  R2  =  0.999.  Based  on  availability 
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Table  1.  Forecasting  80  years  of  DJIA  market  index 


Model 

Type 

Ref. 

Nurn.  rules 

NDEI 

iSeroFAM  * 

Mamdani 

. 

19.4 

0.0282 

0.9996 

EFuNN 

Mamdani 

[12] 

91.6 

0. 1426 

0.9917 

DENFIS 

T-S 

[13] 

5.0 

0.0157 

0.9999 

hSoroFAM  is  fully  online  self-reorganizing. 


1980  198S  1990  1995  2000  2005  2009 

Observation  Time  (T) 

Fig.  5.  Dow  Jones  Industrial  Average  7 n(t  F  1)  forecasting  results 


constraints,  this  experiment  was  benchmarked  against  DENFIS  and  EFuNN  for 
reference.  From  the  tabulations,  it  is  evident,  that  iSeroFAM  outperforms  the 
Mamd&ni-based  EF11NN  [12]  both  in  terms  of  accuracy  and  the  number  of  rules 
generated.  On  the  other  hand,  the  Takagi-Sugeno-based  (T-S)  fnzzy-preeision 
DENFIS  model  has  a  comparative  advantage  against  iSeroFAM  in  terms  of  ae- 
cnraey  and  number  of  rules  list'd.  However,  the  results  have  to  be  interpreted 
care.  Although  DENFIS  and  EFuNN  are  dynamic  learning  systems,  they  are 
not  exactly  online-reasoning.  DENFIS  normalizes  data  before  learning,  which 
indicates  assumptions  of  prior  knowledge  of  the  upper  and  lower  bound  of  the 
dataset,  whereas  only  the  rule  nodes  layer  evolves  in  EFuNN  [25].  O11  the  other 
hand,  online  self-reorganizing  models  such  as  iSeroFAM  are  challenged  without 
prior  knowledge  of  the  complete  set  of  datapoints  at  any  point  in  time. 

Next,  the  analysis  fast-forwards  to  the  period  of  changes  in  the  last  .‘10  years, 
between  1980  to  2009.  Fig.  5  provides  the  forecast  plots  of  outputs.  m(T  F 
1)  against  the  desired  actuals.  The  forecasts  from  iSeroFAM  noticeably  follow 
through  the  trajectory  shifts  in  the  DJIA  index,  including  the  two  peaks  and  two 
valleys  occurring  in  years  2000,  2007,  2003  and  2009  respectively.  During  these  30 
years  of  online  learning,  iSeroFAM  performed  at  least  14,  major  reorganizations 
in  the  rule-base  as  can  be  observed  from  Fig.  6(a).  During  times  of  change, 
rules  are  quickly  unlearnt  and  new  ones  learnt  to  improve  the  currency  of  the 
knowledge  representation.  In  real  terms,  iSeroFAM  effectively  re-learnt  its  rule 
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(b)  Self- reorganization  of  rule  neurons  for  Year  2009. 

Fig.  G.  Self-reorganization  of  associations  in  rule  neurons  during  learning  process 


(*)  Self  reorgeniiing  of 
membership  cluster  widths 
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Fig.  7.  Self- reorganization  of  rn(T  -f  1)  cluster  neurons  over  30  years 
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(a)  Interval- forecasting  (without  moving  average)  for  Year  2009. 
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(l>)  Interval-forecasting  (with  moving  average)  for  Year  2009. 


Fig.  8.  Interval- forecasting  with/ without  moving  average  smoothening 


associations  on  movements  in  the  D.JIA  about  once  every  two  years,  which  is 
rather  consistent  with  the  experimental  observations  in  22].  Also,  while  the  rule 
count  appears  to  fluctuate  a  lot  in  Fig.  5,  the  visualization  for  the  year  2009  in 
b  ig.  6(b)  shows  that  the  online  rule- learning  process  is  rather  gradual. 

The  BGM  rule  learning  process  is  supported  by  a  self-reorganizing  of  cluster 
algorithm  described  [22].  Reorganization  of  the  cluster  nodes  occurs  in  three 
aspects:  (a)  cluster  widths,  (b)  cluster  centroids  and  (c)  number  of  clusters  nodes. 
Fig.  7  provides  a  visual  summary  of  how  the  clusters  shift  and  spread  into  new 
data  regions  over  three  decades.  In  addition,  the  bottom  half  of  Fig.  7  shows 
that  about  twenty  output  m(T  -f  1)  clusters  are  used  on  average. 
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Fig.  9.  Interval-multiplier  effect  on  bounded  classification  rate 


Fig.  10.  Correlation  between  normalized  moving  average  interval  and  daily  differences 

Next,  the  interval-forecasting  based  on  eqn.  (5)  for  fc  =  3  is  examined  for  the 
DJIA  index.  From  iSeroFAM,  the  interval-forecasts  for  Year  2009  are  generated 
with  the  lower  bounds,  upper  bounds  and  spot  forecasts  as  shown  in  Fig.  8(a). 
At  any  point  in  time,  when  the  “desired  output  falls  within  the  lower  and 
upper  bounds  of  forecasts,  the  output  is  considered  to  be  classified  correctly. 
When  the  “desired'  output  falls  out  of  the  bounds  interval-forecast,  the  output 
is  considered  to  be  classified  incorrectly.  The  bottom  half  of  Fig.  8(a)  indicates 
the  bounded  classification  as  V  when  the  output  is  correctly  classified,  and  'O' 
when  the  output  is  incorrectly  classified.  In  this  case,  the  bounded  classification 
is  about  84%  for  the  eighty  years  of  DJIA  interval- forecasting. 
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To  smoother!  the  interval-forecasting  from  Fig.  8(a),  a  ten-day  moving  average 
of  the  interval- forecasts  were  computed  that  is  shown  in  Fig.  8(h).  As  can  be 
seen,  the  interval-forecasting  with  moving  average  improves  the  readability  of  the 
lower  and  upper  bounds  of  the  interval- forecasts.  With  the  moving  average,  the 
bounded  classification  rises  to  about  87%  of  outputs  falling  within  the  bounded 
classification. 

Logically,  the  bounded  classification  rate  is  affected  by  the  interval-multiplier 
specified  in  eqn.  (5).  The  larger  the  multiplier,  the  higher  the  bounded  classifi¬ 
cation  rate.  Fig.  9  examines  the  impact  of  the  interval-multiplier  with  respect 
to  the  bounded  classification  for  the  interval-forecasts,  with  and  without  mov¬ 
ing  average.  As  can  be  seen,  the  moving  average  interval-forecasts  work  only 
better  at  higher  multiplier  values.  On  the  other  hand,  as  mentioned  earlier  for 
Fig.  8(b),  the  moving  average  approach  provides  improved  visualization  of  the 
interval-forecasting.  In  addition,  it  lias  been  rioted  that  there  is  a  positive  corre¬ 
lation  of  0.71  between  the  actual  daily  differences  r/(7"  T  1)  =  w(T  T  1 )  —  vi(T). 
and  the  moving  average  interval-forecasts’  on  a  normalized  basis.  This  is  inter¬ 
esting  for  further  detailed  study  because  it  presents,  on  a  preliminary  basis,  that 
the  moving  average  interval- forecasts  could  provide  some  forecast  of  the  real 
volatility  risk  of  the  D.J1A  index. 

4  Summary  and  Conclusion 

This  paper  presents  iSeroFAM,  an  online  self-reorganizing  nemo-fuzzy  approach 
that  is  based  on  the  BCM  theory  of  rnetaplasticity.  BCM  theory  accounts  for 
temporal  shifts  in  learning  online  patterns  through  a  self-correcting  associative 
and  dissociative  learning  mechanism.  The  experimental  proof-of- concept  for  the 
iSeroFAM  approach  was  based  on  the  real-world  DJIA  Index.  Preliminary  find¬ 
ings  show  that  iSeroFAM  reorganizes  its  rules  and  (  lusters  about  once  in  two 
years  to  meet  changing  environmental  conditions.  Also,  moving  average  based 
interval- forecasting  appear  to  be  a  useful  variability  indicator  of  real  volatility 
in  the  DJIA  market  index. 
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Abstract.  Traditional  designs  of  neural  fuzzy  systems  are  largely  user- 
dependent  whereby  the  knowledge  to  form  the  computational  structures 
of  the  systems  is  provided  by  the  user.  By  designing  a  neural  fuzzy  sys¬ 
tem  based  on  experts’  knowledge  results  in  a  non-varying  structure  of 
the  system.  To  overcome  the  drawback  of  a  heavily  user-dependent  sys¬ 
tem,  self-organizing  methods  that  are  able  to  directly  utilize  knowledge 
from  the  numerical  training  data  have  been  incorporated  into  the  neu¬ 
ral  fuzzy  systems  to  design  the  systems.  Nevertheless,  this  data-driven 
approach  is  insufficient  in  meeting  the  challenges  of  real-life  application 
problems  with  time- varying  dynamics.  Hence,  this  paper  is  a  novel  at¬ 
tempt  in  addressing  the  issues  involved  in  the  design  for  an  evolving 
Type- 2  Mamdaui-type  neural  fuzzy  system  by  proposing  the  evolving 
Typc-2  neural  fuzzy  infer  nice  system  (cT2FlS)  an  online  system  that 
is  able  to  fulfill  the  requirements  of  evolving  structures  and  updating 
parameters  to  model  the  non-stationeries  in  real-life  applications. 

Keywords:  Evolving  systems,  online  systems,  neural  fuzzy  systems, 
incremental  sequential  learning.  Type- 2  fuzzy  systems. 


1  Introduction 

There  are  two  main  issues  to  consider  in  the  design  of  a  neural  fuzzy  system: 
(1)  the  fuzzy  partitionings  of  the  input-output  dimensions  and  (2)  the  gener¬ 
ation  of  the  fuzzy  rnlebase  of  the  system.  Traditionally,  the  design  of  a  neu¬ 
ral  fuzzy  system  is  largely  user-dependent  whereby  both  the  fuzzy  partitioning 
and  the  rnlebase  of  the  system  are  manually  crafted  by  human  experts.  The 
structure  of  the  neural  fuzzy7  system  is  fixed  once  the  necessary  knowledge  has 
been  determined  by  the  experts,  and  only  the  parameters  of  the  system  are  up¬ 
dated  in  subsequent  training.  In  order  to  minimize  the  dependency  on  subjective 
information  from  human  users,  numerical  methods  such  as  fuzzy  Kohonen  par¬ 
titioning  [2].  fuzzy  C- means  [l]  and  linear  vector  quantization  [9]  were  incorpo¬ 
rated  into  the  systems  to  directly  acquire  knowledge  from  the  numerical  training 
data  to  perform  fuzzy  partitioning.  In  addition,  self-organizing  rule  generation 
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schemes  [13]  [15]  12]  were  also  proposed  to  overcome  the  knowledge  acquisi¬ 
tion  bottleneck.  This  subsequently  leads  to  a  new  class  of  neural  fuzzy  systems 
with  self-organizing  abilit  ies  that  are  able  to  directly  utilize  knowledge  from  the 
numerical  training  data  to  design  the  computational  structures  of  the  systems. 

Nevertheless,  the  demands  and  complexities  of  real-life  applications  often  re¬ 
quire  the  neural  fuzzy  system  to  be  able  to  adapt  not  just  its  parameters,  but 
also  its  structure  in  order  to  model  the  changing  dynamics  of  the  application  en¬ 
vironments.  Subsequently,  this  leads  to  an  intense  research  effort  in  the  studies 
of  evolving/online  neural  fuzzy  systems  which  are  able  to  adapt  both  the  struc¬ 
tures  and  the  parameters  of  the  systems  to  model  such  time- varying  dynamics. 
Depending  on  the  formulation  of  the  set  of  fuzzy  rules  that  governs  the  computa¬ 
tional  structure  of  the  network,  there  are  generally  two  classes  of  evolving  neural 
fuzzy  systems,  mainly  the  Takagi-Sugeno-Kang  (TSK)  systems  [3]  [8]  [4]  [5]  and 
the  Mamdani  systems  [10]  [7]  [14].  Most  of  the  existing  work  in  the  literature 
consists  of  evolving  Type-1  TSK-type  and  Type-1  Mamdani-type  neural  fuzzy 
systems.  These  models  may  not  perforin  adequately  under  noisy  application  en¬ 
vironments  when  compared  to  their  Tvpe-2  counterparts  due  to  the  use  of  crisp 
membership  grades.  Hence,  there  have  been  recent  efforts  to  extend  the  evolving 
Type-1  TSK-type  neural  fuzzy  systems  to  Type-2  systems  as  seen  by  the  emer¬ 
gence  of  the  SEIT2FNN  [4]  and  the  ORGQACO  [5]  models.  In  contrast,  there 
has  been  no  such  attempt  in  the  parallel  track  for  evolv  ing  Type-1  Mamdani-type 
neural  fuzzy  systems. 

This  paper  is  a  novel  attempt  in  synergizing  the  individual  frameworks  of 
evolving  systems  and  Type-2  Mamdani-type  neural  fuzzy  systems  by  present¬ 
ing  the  evolving  Type-2  neural  fuzzy  inference  system  (eT2FIS).  The  proposed 
cT2FIS  model  adopts  a  two  phase  incremental  sequential  learning  scheme 
whereby  the  neural  fuzzy  system  performs  structural  learning  and  parameter 
learning  upon  the  arrival  of  each  new  training  data  point.  Initially,  there  are  no 
fuzzy  partitioning  or  fuzzy  rules  in  the  system,  i.e.,  there  are  no  hidden  nodes 
in  the  network.  Subsequently,  the  computational  structure  of  the  neural  fuzzy 
system,  which  is  governed  by  a  set  of  Type-2  JF-TIIEN  Mamdani  rules,  is  in¬ 
crementally  formulated  based  on  the  knowledge  from  each  training  data  point. 
There  are  three  key  operations  contained  in  the  structural  learning  phase  of  the 
system:  (1)  the  generation  of  new  fuzzy  rules,  (2)  the  deletion  of  obsolete  rules, 
and  (3)  the  merger  of  highly  similar  fuzzy  labels;  while  parameter  learning  is 
performed  using  the  neural-network  based  backpropagation  mechanism. 

The  rest  of  the  paper  is  organized  as  follows.  The  structure  and  the  operations 
of  eT2FIS  are  described  in  Section  2.  Section  3  presents  the  online  learning 
mechanism  of  eT2F!S.  The  adaptation  abilities  of  the  system  are  evaluated  in 
Section  4.  Section  5  concludes  the  paper. 

2  cT2FIS:  Architecture  and  Neural  Operations 

The  eT2FIS  is  a  five  layers  neural  fuzzy  system  as  shown  in  Fig.  1.  Layer  1 
of  the  system  consists  of  the  input  linguistic  nodes:  layer  2  consists  of  the  an- 
tedecent  nodes;  layer  3  is  the  rule  nodes;  layer  4  is  the  consequent  nodes:  and 
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Fig.  1.  Architecture  of  the  evolving  Type- 2  neural  fuzzy  inference  system  (eT2FIS) 


layer  5  consists  of  the  output  linguistic  nodes.  As  mentioned  before,  the  initial 
form  of  the  neural  fuzzy  system  consists  of  no  hidden  layers  and  learning  for  the 
system  is  performed  incrementally  when  each  training  tuple  [X (1):  I)(t)}  is  pre¬ 
sented  to  the  system  one  at  a  time  where  X(t)  =  (x\ (t) . r,  ( /) . vj( t)) 

and  I)(t)  =  (d\  dm  (t))  represent  the  vectors  for  the  input 

training  data  and  the  corresponding  desired  output  data  at  time  step  i  re¬ 
spectively.  Each  input  node  /V*.  ?  G  {1.../},  in  layer  1  of  the  system  takes 
ill  a  single  input  value  .r,  from  the  input  training  vector  and  subsequently, 
each  output  node  OVjn<  in  G  { 1 . . .  M } .  in  layer  5  produces  a  single  output 
value  ym  where  the  corresponding  computed  output  vector  is  represented  as 
Y(t)  =  (2/1  (£)>  •  •  • « ,  2/a/(0)'  During  time  step  /.  each  input  node  will 
consist,  of  Ji(t)  number  of  corresponding  fuzzy  labels  in  layer  2  of  the  system 
such  that  each  antecedent  node  is  represented  as  ILijti  jt  G  {l ...  ./,•(!)}.  Sim- 
ilarlv,  each  output  node  will  consist  of  Lm(t)  number  of  corresponding  fuzzy 
labels  in  layer  4  of  the  system  such  that  each  consequent  node  is  represented  as 
OLim%m.  lm  G  {1  . . .  Lm(i)}~  The  connect iomst  st  ructure  of  the  proposed  model 
is  based  on  a  set  of  fuzzy  rules  A:  G  {1  . . .  /\  (/.)},  defined  in  layer  3  of  the 
system.  In  the  proposed  model,  the  number  of  rules  K (£),  the  number  of  fuzzy 
labels  for  the  *-tli  input  variable  and  the  number  of  fuzzy  labels  for  the 

7/1-th  output  variable  Lm(t)  vary  with  U10  changes  in  the  underlying  dynamics 
of  the  application  environment. 

For  the  proposed  eT2FIS  model,  the  training  parameters  are  the  centers  of 
the  left  and  right  formation  gaussiau  functions  of  the  interval  Type-2  fuzzy  la¬ 
bels  present  in  layers  2  and  4  of  the  network  as  shown  in  Fig.  2.  Each  fuzzy 
label  in  the  antecedent  layer  and  consequent  layer  is  defined  by  its  footprint  of 

uncertainty  [11]  f.i^(x)=  fL  where  A  denotes  the  Type- 2  fuzzy  set. 
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Fig.  2.  An  interval  Type- 2  fuzzy  set  in  the  antecedent/consequent  layer  denoted  by  its 
left  and  right  formation  gaussian  functions 


Subsequently,  the  lower  and  upper  membership  functions  of  A  are  defined  as 

_  Hl  (c£  ,  <t;  .t)  if  x  <  cL 

!LAW 


|  iin(cR,<r;  x)  if  x  <  ftnd  (x)  =  , 

\li’L\cL'cr:x)  otherwise  * 


1  if  cl  <  x  <  cr 

fin{cR,a,:v)  if  x>cR 
respectively.  Here,  a;  x)  and  }ir{cr,  a\  t)  refer  to  the  left  and  right  for¬ 

mation  gaussian  functions  respectively  sneh  that  they  are  defined  based  on  the 
underlying  gaussian  function  /i(c,  <r\x)  —  )  where  c  is  the  centre  of 

the  function  and  cr  is  the  width  of  the  function. 

Next,  the  two  key  neural  operations  in  the  proposed  model,  namely  the  for¬ 
ward  and  the  backward  computations,  are  described  as  follows. 

2.1  Forward  Operations 

The  forward  aggregated  input  and  output  for  an  arbitrary  node  are  denoted  as 
NET  and  Z  respectively. 


Layer  1:  NETIVi  =  Z]Vx  -  xt  . 

Layer  2:  NET,~L.  ..  =  x,  and  Z,~L  j  = 
between  the  input  value  x1  arid  the  respective  fuzzy  labels  is  an  interval  Type-1 

{Cl  4-C  ty 

(cn,.}i .  cr;xi)  if  xi  <  l,J'  2 
HLi,Jt  (cLi ,j. ,  cr,  x,)  otherwise 

PLi.iSCLCii'(T'’Xi)  ifl<  < 

1  if  Chj.  <  x j  <  CRi  js  respectively. 

/i/?,. Jt  (cRi j. ,  a\  Xi)  if  n  >  Cftt  jj 


set  with  bounds  defined  by  / 


such  that  the  similarity 


and 


feu  = 


Layer  3:  NETnk 


(fc)  T(fc) 


where  the  overall  simi- 


_  -eu  '*•*]}  al,d  Znx  = 
larity  between  the  input  vector  and  the  antecedent  segment  of  the  A-th  fuzzy  rule 
is  an  interval  Type-1  set  with  bounds  given  as  fk  =  min^€p.../j  and  f k  = 

minie{L.  1)7^1  respectively. 


Layer  4:  NETdLim  m  =  { [/*,/*]  }  and  ZdLt„ 


f, 


where 


=  maxfcg lk  and  flm  in  =  maxfc€K(m.m  fk  respectively.  Here,  I<lm,n 


is  the  set  of  fuzzy  rules  in  the  system  that  share  the  same  output  label  OLin 
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Layer  5:  N EToynt  =  Ym  and  Zovm  =  !Jw  where  the  type-reduced  set  ob¬ 
tained  using  the  height-type-rednct  ion  (IITR)  [6]  is  an  interval  Type-1  set  Tm  := 


/'  ...  f 


Pi  , 


1/ 


y*  km< O  - 

2^1,  n  1  ylr 


n  Pin 


0 

l 


rn 


=  [mr.rr^].  Here,  yfmtm  is  defined  to 


be  the  midpoint  of  the  domain  of  OLimifnt  and  p  £ 


.  Then  the 


computed  output  is  given  to  be  the  defnzzilied  value  yfn  —  ~  Tr"llfl  +  Vr"lnx]. 


2.2  Backward  Operations 

The  backward  operation  of  the  eT2FlS  model,  as  represented  by  the  dotted 
arrows  in  Fig.  1,  from  layer  5  to  layer  3  of  the  system  is  a  mirrored  computation 
of  t  lie  forward  operation.  Correspondingly,  the  backward  aggregated  input  and 
output  for  an  arbitrary  node  are  denoted  as  NET^mc^  and  Z^mc^  respectively. 

Layer  5:  NET^  =  z\™k  dm  . 


Layer  4:  NET^fck  =  d,„  and  z)’;uk 


OLi 


/•back  yback 
Llm.m  *  Jlm  Mi 


bounds  are  defined  as  /!)<u  ^  = 

±4m.m 


-back 

and  = 


such  that  the 
cl,  +<■/*, 

* —  *”  *  in  *,n 


Pftlm.m  .»«  '  <^rn  )  ^  (^,n  —  2 

?  a*  dm )  Otherwise 

if  (L,n  <cLtw  tn 

1  11  cUm.m  <  dm  <  respectively. 

tifhut.uA(Kitn,n'fT'dm)  if  dfn  >  c/j#  iri  . 


Layer  3:  NET}^ 


back 


<*)! 


and  Z 


back  _ 
ih  ~ 


/•back  -?back 
Ik  '  J  ft 


i  /-back  ■  /  rback\^  —back  (— backV 

where  [hm,m)  and  fk  =  nnnme{1...A/}  I  flni>tn  J 

repeetively. 

The  backward  neural  computation  of  the  eT2F!S  is  defined  to  (l)ealculate  the 
certainty  factors  of  the  fuzzy  rules,  and  (2)determine  the  creation  of  a  new  fuzzy 
rule  (refer  to  Section  3)  when  each  training  tuple  [,Y (/);.£)(/)]  is  presented  to 
the  system.  The  certainty  factor  of  a  fuzzy  rule  in  the  system,  as  defined  ill  (1), 
rellects  the  potential  of  the  rule  in  describing  the  current  underlying  dynamics 
of  the  application  environment. 


Ccn.(i)  :=  niax[Ago*(/),AcU(0)  ;  Crr *.(())  :=  1  (1) 


where  Ag ek{t) 
Act *(<)  :  =  min 


:=  Vk  ■  Ce>k{t  -  1) 
5  /,  +  7a]  •  { 


/•back 

Lk 


constitutes 
-=back 
+  fk 


the  forgetting  component  and 
constitutes  the  enhancement 


component  to  the  certainty  factor.  Initially,  the  certainty  factor  for  a  newly 
formed  fuzzy  rule  is  set  as  unity.  This  means  that  the  newly  created  rule  is 
assigned  the  highest  degree  of  faith  in  its  ability  to  model  the  application  en¬ 
vironment  since  0  <  Cer*  <  1.  As  time  progresses,  the  determination  of  the 
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certainty  factor  of  a  rule  is  either  dominated  by  the  forgetting  component  Age*, 
or  the  enhancement  component  Act*.  If  a  rule  in  the  system  is  able  to  generalize 
the  recent  encountered  set  of  training  data  well,  the  dominating  factor  in  the  cal¬ 
culation  of  its  certainty  factor  is  the  enhancement  component.  This  subsequently 
enables  the  computed  certainty  factor  to  be  of  a  high  value,  thus  ensuring  that 
the  rule  will  remain  in  the  fuzzy  rulebase  of  the  model.  On  the  other  baud,  if  a 
rule  fails  to  give  a  satisfactory  representation  of  the  current  set  of  encountered 
training  data,  then  the  forgetting  mechanism  kicks  in.  Subsequently,  the  faith  in 
the  rule  decreases  gradually  over  time  until  it  becomes  invalid  to  the  application 
or  it  gets  recovered  through  a  rehearsal  episode.  Hence,  through  this  incremental 
update  of  the  certainty  factors  for  the  fuzzy  rules  in  the  proposed  model,  the 
system  is  ensured  a  current  and  up-to-date  set  of  rulebase  that  is  able  to  model 
the  underlying  dynamics  of  the  application  environment. 


3  Incremental  Learning  in  eT2FIS 


The  proposed  eT2FIS  model  adopts  a  two  phase  incremental  learning  process, 
namely  the  structural  learning  and  the  parameter  learning,  as  shown  in  Fig.  3. 
Three  key  operations  are  contained  within  the  structural  learning  phase  of  the 
system:  (1)  the  generation  of  new  fuzzy  rules,  (2)  the  deletion  of  obsolete  rules, 
and  (3)  the  merger  of  highly  over-lapping/ similar  fuzzy  labels:  while  parameter 
learning  is  performed  using  the  neural-network  based  backpropagation  mecha¬ 
nism.  The  initial  neural  fuzzy  system  is  empty,  i.e,  there  are  no  hidden  layers,  and 
learning  is  performed  incrementally  where  each  training  tuple  is  presented  to  the 
system  individually  at  time  step  t  When  the  first  training  sample  [A"(0);  £>(())] 
arrives,  the  knowledge  from  the  training  data  point  is  used  to  initialize  the  sys¬ 
tem  by  forming  the  fuzzy  labels  IL{  \  and  OL\  m  such  that  the  centres  of  the 
left  and  right  functions  of  the  new  fuzzy  labels  are  set  to  be  the  corresponding 
input  and  output  values  from  the  vectors  A"(0)  and  D( 0).  The  width  rr  is  fixed 
in  the  functions.  In  addition  to  establishing  the  fuzzy  partitionings  in  the  input- 
output  dimensions  of  the  system,  a  new  fuzzy  rule  is  also  created  to  encode 
the  knowledge  represented  by  the  training  data  point  where  the  antecedent  and 
consequent  segments  of  the  new  fuzzy  rule  arc  defined  by  the  respective  sets  of 

newly  created  fuzzy  labels  j/L;ij  and  |OTi,mj  •  On  the  other  hand, 

the  online  structural  and  parameter  learning  process  of  the  system  are  activated 
by  an  incoming  training  tuple  if  there  are  existing  rules  in  the  system,  and  the 
system  evolves  and  learns  based  on  the  information  provided  by  the  new  training 
tuple.  This  section  describes  the  online  learning  mechanism  of  the  eT2FlS. 


3.1  Structural  Learning 

The  three  main  operations  in  the  structural  learning  phase  of  the  system  are 
described  as  follows. 
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Fig.  3.  Flowchart  of  the  incremental  learning  process  in  cT2F!S 


Creation  of  new  rule:  The  structural  learning  phase  of  the  proposed  model 
is  activated  with  the  arrival  of  a  new  training  data  sample  [Ar(/):D(/)]  to  a 
non-empty  neural  fuzzy  system.  If  there  exists  a  fuzzy  rule  /?*•*  in  the  current 
rnlebase  of  the  system  such  that  it  is  able  to  represent  the  training  sample 
competently,  the  system  proceeds  on  to  the  next  stage  of  the  structural  learn¬ 
ing  phase.  This  is  determined  by  the  condition  7»V  (X(t),  D(t))  >  RuleGen , 
k*  —  arginaxA*e{i  A'(*)}  Rk  where  the  activation  of  the  rule  77/,  by 


X(t):  /}(/)]  is  given  as  7?*-  (A'(/),  D(t))  :=min 


L  +  /  k 


/•back  ,  -rback 
1*  +  J  * 


and  RuleGen  is  a  pre-defined  rule  creation  threshold.  In  this  paper,  RuleGen 
is  fixed  as  a  constant  O.G.  Subsequently,  the  certainty  factors  for  the  fuzzy  rules 
are  updated  using  (I)  and  the  system  moves  on  to  the  second  operation  in  the 
structural  learning  phase. 

On  the  other  hand,  if  none  of  the  rules  in  tlu'  system  is  able  to  give  a  satisfac¬ 
tory  representation  of  the  training  data  point,  a  new  fuzzy  rule'  Rf  is  created  te> 
encrypt  the  knowledge  from  the*  training  sample*.  The  system  proceeds  by  finding 
the  best  matched  fuzzy  labels  IL^j*  and  OLi «  ,7„  to  the  data  point  X(i):  P(t)} 


where* 


jt 

< 


argnmxJ/€{1...J((t)}  ± 
arginax{m€{1..,/.m(t)} 


1_ 

2 


/•back 

i4m,m 


Snbsc'fjuently.  each  of 


the*  labels  in  the  set  of  best,  matched  fuzzy  labe*ts  can  be*  categorised  inte>  three 
operations  as  follows: 


1. 


2. 


No  action  is  required  for  the  best  matched  fuzzy  label  and  it  is  defined  as  part 
of  tin  antecedent/consequent  segment  of  the  rule  R( .  This  scenerio  occurs 
when  the  match  between  the  input  value  xt(t)  (resp.  output,  value  dm{t)) 
anel  the  corresponding  lx*st  matched  label  ILij *  (resp.  OLp  m)  is  highly 


similar,  i.e..  ^ 


lUj; 


+  fiJ 


>  0.75  or 


rback  ,  ack 

Ll'm 


>  0.75. 


No  action  is  required,  for  the  best  matched  label  and  a  new  label  is  created  as 
part  of  the  antecedent/ cons t(pi en t  segment  of  the  rule  IV .  This  scenerio  occurs 
when  the  similarity  between  the  input-output  value  and  its  corresponding  best 


542  S.VV.  Tung,  C.  Quek,  and  C.  Cinan 


matched  label  is  minimal,  i.e., 


A* 


+/, 


<  0.25  or  b 


/•back  ,  yback 

if*  TO  J  bn’™ 


<  0.25.  A  new  fuzzy  label  ILl  Ji{t+x),  Jt(t  +  I)  =  Ji(t)  +  1,  or  OLLm{t+  1)fTrM 
Lm(t  +  1)  =  Lm(t)  +  1,  is  created  such  that  the  centres  of  the  left  and  right 
functions  of  the  new  fuzzy  label  is  set  to  be  the  corresponding  input-output 
value  from  the  training  vector. 

0.  The  spread  of  the  best  m at ehed  fuzzy  label  is  expanded  and  the  expanded  label 
is  defined  as  part  of  the  antecedent /consequent  segment  of  the  rule  /?' .  This 
scenerio  occurs  when  the  similarity  between  the  input-output  value  and  its 
corresponding  best  matched  label  falls  in  the  interval  [0.25,0.75],  i.e.,  the 
match  is  reasonable  but  not  satisfactory.  Subsequently,  the  best  matched 
label  will  expand  itself  by  increasing  the  spread  s  between  the  centres  of  the 
left  and  right  functions  of  the  fuzzy  label  to  incorporate  the  current  input- 

'■  +  1  \  r_ni5n  f>ax’  Sif (/)  +  Vs  ■  W1  ,  such  that 

tv  "b  1)  mill  [&max>  *ls  *  ^ninxj 

s,nax  is  the  maximum  permissible  spread  for  each  of  the  fuzzy  label. 


output  value  by 


.miv,uuiio 

,y 

I  S*m'n 


Although  a  new  fuzzy  rule  Rf  has  been  created  as  described  above,  it  will  only 
be  included  in  the  rulebase  of  the  neural  fuzzy  system  if  it  is  not  ambiguous  and 
the  novelty  of  the  fuzzy  rule  is  ascertained. 


Merger  of  Highly  Over-Lapping  Fuzzy  Labels:  The  second  stage  in  the 
structural  learning  phase  of  the  proposed  model  is  the  merging  of  two  highly 
similar /over-lapping  fuzzy  labels  in  each  of  the  input-output  dimensions.  If 
the  similarity  measure  between  two  interval  Type- 2  fuzzy  labels  A\  and  A 2, 
SM(A\ ,  A2),  is  greater  than  a  merger  threshold  the  fuzzy  labels  A\  and 

_  CLt+Ct,2  _  ±nt2 
2  1  “  2 

„}  (*  +  l)  =  {ip}  (t)\A2 
set  of  fuzzy  labels  in  the  corresponding  input-output  dimension  at  time  step  /, 
and  C/,,  and  c/q  are  the  centres  of  the  left  and  right  functions  of  the  Type-2 
label  A\  respectively. 

Deletion  of  Obsolete  Rule:  The  filial  stage  in  the  structural  learning  phase 
of  the  proposed  model  is  the  deletion  of  any  obsolete  rules  that  are  present 
in  the  rulebase  of  the  system  at  time  step  t.  A  fuzzy  rule  in  the  neural  fuzzy 
system  is  regarded  as  an  invalid /out-dated  rule  in  the  system  if  the  certainty 
factor  of  the  rule  falls  below  a  threshold  Rule  Del  where  RuleDcl  represents 
the  minimum  potential  that  a  fuzzy  rule  should  possess  for  it  to  be  considered 
having  the  ability  to  model  the  current  underlying  dynamics  of  the  application 
environment.  In  this  paper,  Rule  Del  is  fixed  as  a  constant  0.35.  Subsequently, 
K{t+\)  = 

The  combination  of  the  three  operations  within  the  framework  of  the  proposed 
eT2FlS  model  ensures  that  the  neural  fuzzy  system  maintains  a  set  of  up-to- 
date  and  compact  fuzzy  rulebase  that  is  able  to  model  the  current  underlying 
dynamics  of  the  application.  This  is  because  a  new  rule  is  created  when  the 
new  training  data  point  cannot  be  represented  satisfactorily  by  the  existing 


lie  1 


{  A.}  (t)  is  the 


A 2  are  merged  such  that 


CL\ 
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set  of  fuzzy  rules  in  the  system;  and  obsolete  rules  are  deleted  when  they  are 
no  longer  valid  under  the  current  application  environment..  In  addition,  highly 
over-lapping/ similar  fuzzy  labels  in  each  of  the  input-output  dimensions  are  also 
merged  to  reduce  the  computational  complexity  of  the  neural  fuzzy  system  and 
(his  helps  to  improve  the  overall  interpret  ability  of  the  system. 

3.2  Parameter  Learning 

The  second  phase  in  the  incremental  sequential  learning  process  of  the  pro¬ 
posed  model  is  parameter  learning.  After  a  newly  arrived  training  data  sample 
[A'(f):  D(t)}  passes  through  the  structural  learning  phase,  it  will  activate  param¬ 
eter  learning  in  the  system  where  the  objective  of  the  parameter  adaptation  is 
to  minimize  the  difference  in  error  between  the  computed  output  V”(£)  and  the 
desired  output  D(t)  at  each  time  step  t.  The  error  function  at  time  t  is  thus  de¬ 
fined  as  E  =  ^  YLm  1  [^m  —  lhn\2 -  Parameter  adaptation  in  the  proposed  eT2FIS 
is  performed  based  on  a  neural-network  based  backpropagation  mechanism. 

4  Experimental  Results 

This  section  describes  two  experimental  simulations  performed  by  eT2F!S, 
namely  system  identification  of  a  time- varying  plant  and  tlial  of  a  plant  with 
noise. 

4.1  System  Identification  of  a  Time- Varying  Plant 

To  illustrate  the  abilities  to  evolve  and  adapt,  the  eT2FlS  model  is  employed 
to  model  the  underlying  characteristics  of  a  time-varying  plant  as  described 


.  The  initial  conditions  (ti(O),  t/(0))  are  set  as  (0,0) 


and  the  objective  of  the  experiment  is  to  identify  the  output  y(t  T  1)  given  the 
input  vector  (u(/),  y(t))  at  each  time  step  t.  For  the  purpose  of  this  experiment, 
3000  data  tuples  are  produced.  The  proposed  system  is  employed  to  identify  the 
plant  in  an  online  sequential  mode,  i.e..  there  is  no  prior  knowledge  of  the  plant 
such  that  the  training  tuples  are  presented  to  the  system  individually  at  each 
time  step  through  a  single  pass. 

Fig.  1(a)  illustrates  the  performance  of  the  <  T2F1S  model  in  t lit'  modeling  of 
the  time-varying  plant.  Fig.  4(a)(i)  shows  the  number  of  rules  identified  by  the 
proposed  model  during  the  online  identification  of  the  plant.  The  fluctuations 
in  the  number  of  rules  identified  at  the  start  of  the  experiment,  the  start  of 
t  —  1000  (when  a  disturbance  /(/)  is  added)  and  the  start  of  t  2000  (when 
/(/)  is  removed)  indicate  that  the  model  is  trying  to  learn  the  underlying  char¬ 
acteristics  of  the  plant.  After  which,  the  number  of  identified  rules  stabilizes 
before  any  changes  are  detected  in  the  underlying  dynamics  of  the  environment. 
Fig.  4(a)(ii)  shows  the  total  number  of  rules  ident  ified  for  the  modeling  of  the 
time- varying  plant  achieved  by  eT2F!S  and  the  benchmarking  models,  namely 
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Time  Step 
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Time  Step 

am 


System 

Type 

#  Rules 

eFSM 

T1-Mam. 

25 

SEIT2FNN 

T2-TSK 

4 

*■612  FIS 

T2-Mam 

16 

(a)(ii) 


(a)(iv) 


Fig.  4.  (a)  Experimental  results  for  the  time- varying  plant  with  additive  disturbance 
/(f):  (i)  Number  of  rules  K(t)  identified  by  eT2FIS  at  each  time  instance,  (ii)  To¬ 
tal  number  of  identified  rules  obtained  for  eT2FlS  and  the  benchmarking  systems, 
(iii)  Online  identification  results  by  eT2FlS,  and  (iv)  Online  identification  results  by 
SE1T2FNN  [4].  (b)  Illustrative  results  for  the  plant  with  noise:  (i)  A  realization  of  the 
identification  results  for  the  plant  with  noise,  and  (ii)  Average  online  learning  errors 
for  the  benchmarking  systems. 


the  eFSM  [14]  and  the  SE1T2FNN  [4]  models.  By  adopting  Type-2  sets  in  the 
system,  the  proposed  model  requires  lesser  number  of  rules  to  model  the  plant 
as  compared  to  the  Type-1  Mamdani-type  eFSM  model.  This  translates  to  a 
computationally  less  cornpiexed  cT2FIS  system.  On  the  other  hand,  it  is  not 
surprising  that  the  SEIT2FNN  model  requires  much  fewer  rules  compared  to 
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the  oT2FIS  model  because  of  the  greater  oomputative  powers  of  TSK-tvpe  sys¬ 
tems.  Nevertheless,  the  proposed  eT2FIS  model  is  able  to  achieve  a  satisfactory 
modeling  performance  as  the  dynamics  of  the  plant  changes  over  time  when 
compared  to  the  SEIT2FNN  model  as  seen  from  the  online  identification  results 
in  Figs.  4(a)  (iii)  (iv). 

4.2  System  Identification  with  Noise 

To  illustrate  the  noise  resistance  abilities  of  the  proposed  evolving  Type- 2  sys¬ 
tem.  the  eT2FIS  is  employed  to  identify  the  plant  y(t+ 1)  =  +*/'*(£),  u(t)  = 

sin(27r//100),  t  =  0. ..  1000  .  Here,  the  measured  output  y(t  T  1)  is  assumed  to 
be  contaminated  by  noise.  The  added  noise  is  an  artificially  generated  Gaussian 
white  noise  with  variance  0.1.  There  are  10  Monte  Carlo  realizations  in  this  ex¬ 
periment.  As  conducted  in  the  previous  experiment,  the  computational  structure 
of  the  eT2FlS  is  incrementally  formulated  with  the  arrival  of  each  training  tuple. 

Fig.  4(b)  shows  the  performance  comparison  between  the  eT2F!S  model  and 
the  Type-1  eT2FIS  model  (eTIFIS).1  Fig.  4(b)(i)  shows  the  learning  results  of 
the  benchmarking  systems  for  one  of  the  10  realizations.  The  computed  output  of 
both  the  oT2FIS  and  the  eTIFIS  models  do  not  fluctuate  as  violently  as  the  ac¬ 
tual  noisy  output,  indicating  that  the  neural  fuzzy  systems  possess  the  abilities  to 
model  uncertainties  in  an  application  environment.  Fig.  4(b) (ii)  shows  the  aver¬ 
age  learning  errors  over  t Ik*  10  realizations  for  the  benchmarking  systems.  Being 
a  Type-2  system,  the  proposed  model  is  more  resistant  to  the  noise  present  in 
the  underlying  dynamics  of  the  environment  as  seen  bv  the  significantly  smaller 
average  squared  error  (calculated  as  an  accumulation  over  100  time  steps)  of  the 
eT2FlS  model  as  compared  to  the  Type-1  model.  This  means  that  while  neural 
fuzzy  systems  arc  able  to  incorporate  the  effects  of  uncertainties  in  the  struc¬ 
tures  of  the  systems,  Type-2  systems  arc'  able  to  demonstrate  a  greater  tolerance 
compared  to  their  Type-1  counterparts  under  a  noisy  application  environment. 


5  Conclusions 

This  paper  presents  the  cT2F  IS  model,  an  evolving  Type- 2  Manidani-type  neural 
fuzzy  inference  system  that  is  able  to  learn,  evolve  and  adapt  with  the  changes 
in  the  environment  that  it  is  modeling.  Encouraging  performances  have  been 
achieved  when  the  system  is  employed  to  identify  a  plant  with  non-stationery 
dynamics  and  a  plant,  with  noise. 

1  The  Type-1  eT2F!S  model,  denoted  as  eTIFIS,  refers  to  a  modified  version  of  the 
proposed  eT2FlS  model  where  the  fuzzy  labels  in  the  aiitecedent/eonseqnent  layers 
of  the  network  are  set  as  Type-1  fuzzy  sets.  1  lie  learning  algorithm  of  the  eTIFIS 
is  similar  to  that  of  the  proposed  model.  The  purpose  of  benchmarking  against  an 
evolving  Type-1  neural  fuzzy  system  with  similar  learning  mechanism  is  to  illustrate 
the  greater  uncertainty  tolerance  of  a  Typo- 2  system  in  a  noisy  environment  when 
compared  to  its  Type-1  counterpart. 
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Abstract.  In  this  paper,  we  propose  a  new  multiple  sensory  fused  human  identi¬ 
fication  model  for  providing  human  augmented  cognition.  In  the  proposed 
model,  both  facial  features  and  nici-frcqucncy  eepstral  coefficients  (MFCCs) 
are  considered  as  visual  features  and  auditory  features  for  identifying  a  human, 
respectively.  As  well,  an  adaboosting  model  identifies  a  human  using  the  inte¬ 
grated  sensory  features  of  both  visual  and  auditory  features.  In  the  proposed 
model,  facial  form  features  are  obtained  from  the  principal  component  analysis 
(PCA)  of  a  human’s  face  area  loeali/cd  by  an  Adaboost  algorithm  in  conjunc¬ 
tion  with  a  skin  color  preferable  attention  model.  Moreover,  MFCCs  are  ex¬ 
tracted  from  human  speech  Thus,  the  proposed  multiple  sensory  integration 
model  is  aimed  to  enhance  the  performance  of  human  identification  by  consid¬ 
ering  both  visual  and  auditory  complementary  working  under  partly  distorted 
sensory  environments.  A  human  augmented  cognition  system  with  the  proposed 
human  identification  model  is  implemented  as  a  goggle  type,  on  which  it  pre¬ 
sents  information  such  as  unknown  people’s  profile  based  on  human  identifica¬ 
tion.  Experimental  results  show  that  the  proposed  model  can  plausibly  conduct 
human  identification  in  an  indoor  meeting  situation 

Keywords:  human  augmented  cognition,  human  identification,  multiple  sensory 
integration  model,  visual  and  auditory,  adaptive  boosting,  selective  attention. 


1  Introduction 

Human  augmented  cognition  is  one  of  the  topics  of  cognitive  science  to  extend  a 
user’s  abilities  via  computational  technologies.  A  person,  even  if  he  or  she  is  not 
handicapped,  has  bottlenecks,  limitations  and  biases  in  cognition.  For  example,  limi¬ 
tations  in  attention,  memory,  learning,  comprehension,  visualization  abilities,  and 
decision  making.  The  goal  of  human  augmented  cognition  research  is  to  develop 
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computational  methods  and  tools  to  overcome  these  problems  and  to  improve  human 
cognition  abilities. 

Over  the  last  couple  of  decades,  there  has  been  a  lot  of  interesting  on  the  design 
and  development  of  several  assistive  devices  aiming  to  provide  people  with  visual 
impairments  with  ability  of  device  manipulation  in  their  daily  activities.  Most  of  these 
devices  have  been  focused  on  enhancing  the  interaction  with  machines  and  environ¬ 
ments  of  a  user  who  is  blind  or  visually  impaired  in  dealing  with  a  computer  monitor, 
a  personal  digital  assistant,  a  cellular  phone,  and  indicating  road  traffic  signals  [1,  2]. 
Although  these  efforts  are  very  essential  for  the  quality  of  life  of  those  visually  handi¬ 
capped  people,  such  an  assistant  system  is  also  helpful  on  the  purpose  of  augmented 
cognition  for  common  people  to  enlarge  his  or  her  cognition  ability  when  they  con¬ 
front  complex  and  distraction  situation 

Thus,  recently,  the  human  augmented  cognition  systems  such  as  visual  and  audi¬ 
tory  assistance  systems  have  received  more  attention  from  many  smart -electronic 
device  communities  [3].  In  order  to  implement  those  assistive  systems,  human  identi¬ 
fication  technologies  arc  one  of  important  issues.  In  terms  of  human  identification, 
face  detection  and  recognition  researches  have  been  tremendously  conducted  as  much 
as  an  amount  of  its  importance  [4,  5].  However,  those  face  recognition  researches 
have  been  utilized  only  visual  information  of  face  in  order  to  identify  human,  which 
have  troubles  caused  by  various  factors  such  as  illumination  change,  image  affine 
transform,  distortion,  and  occlusion  in  real  situation  [4,  5].  Moreover,  even  though 
many  researchers  have  proposed  only  auditory  information  based  speaker  detection 
and  recognition,  these  models  also  have  difficulties  caused  by  various  sound  distor¬ 
tion  occurred  in  real  complex  environment  until  now  [6]. 

For  solving  these  problems,  some  researchers  have  been  proposed  a  combined  vis¬ 
ual-auditory  approach  considering  both  visual  property  and  auditory  property  for 
human  identity  recognition  [7,  8].  However,  these  models  are  considering  different 
sensory  features  in  a  concatenating  manner  but  an  integrating  manner.  Therefore, 
those  human  identification  systems  do  not  consider  associated  features  that  may  pro¬ 
vide  more  complicate  information  for  enhancing  human  recognition. 

Thus,  in  this  paper,  we  proposed  a  new  visual-auditory  fused  model  using  an  inte¬ 
grated  manner  of  multiple  sensory  features  for  enhancing  human  identification.  In 
order  to  obtain  visual-auditory  features,  firstly  visual  and  auditory  features  are  ex¬ 
tracted  from  face  and  speech  of  a  human.  Facial  form  features  are  extracted  as  visual 
features  from  principal  component  analysis(PCA)  of  the  facial  area  localized  an 
Adaboost  algorithm  in  conjunction  with  a  skin  color  preferable  attention  model  [9- 
12].  Also  MFCC  features  are  extracted  as  auditory  features  from  voice  of  a  human. 
Then,  the  extracted  visual  and  auditory  features  arc  integrated,  which  are  used  as 
input  of  a  human  identification  model  implemented  by  a  sensory  fusion  adaboosting 
model  [13].  The  proposed  human  identification  model  is  adapted  to  a  goggle  type 
human  augmented  cognition  system,  which  provides  information  such  as  unknown 
people’s  profile  through  a  goggle  lens  type  screen. 

This  paper  is  organized  as  follows;  Section  2  describes  the  proposed  multiple 
sensory  integrated  model  for  human  identification.  The  implemented  goggle  based 
human  augmented  system  and  experimental  results  will  be  followed  in  Section  3. 
Section  4  presents  our  conclusions  and  discussions. 
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2  Proposed  Multiple  Sensory  Integration  Model 

Fig.  1  shows  the  proposed  multiple  sensory  integration  model.  In  order  to  robustly 
extract  face  features  from  visual  information,  it  needs  to  consider  more  robust  face 
detection.  In  this  paper,  we  consider  skin  color  preferable  selective  attention  model 
which  is  to  localize  a  face  candidate.  The  proposed  face  detection  method  has  smaller 
computational  time  and  lower  false  positive  detection  rale  than  the  well-known 
Adaboost  face  detection  algorithm. 

In  order  to  robustly  localize  candidate  regions  for  faces,  we  make  skin  color  inten¬ 
sified  saliency  map  (SM)  which  is  constructed  by  selective  attention  model  reflecting 
skin  color  characteristics.  Figure  1  shows  the  skin  color  preferable  saliency  map 
model,  in  which  red(r),  green(g),  blue(b)  color  features  are  extracted  from  input  im¬ 
age.  Intensity  feature  is  generated  by  integrating  the  skin  color  filtered  rcd(r), 
grecn(g),  blue(b)  color  features.  RG  color  opponent  feature  is  obtained  from  red(r) 
and  green(b)  color  features  and  edge  feature  is  generated  using  R  G  color  opponent 
feature.  Then,  the  intensity,  edge,  and  color  opponent  feature  maps  are  constructed  by 
the  Gaussian  pyramid  processing  and  CSD&N  algorithms  [10).  Finally,  a  face  color 
preferable  SM  is  generated  by  integrating  these  three  different  feature  maps,  from 
which  the  face  candidate  regions  are  localized  by  applying  a  labeling  based  segment¬ 
ing  process)  1 1].  The  localized  face  candidate  regions  are  subsequently  categorized  as 
final  face  candidates  by  the  Haar-like  form  feature  based  Adaboost  algorithms!  1 1 , 
14].  As  well,  the  visual  features  to  be  integrated  with  the  auditory  features  are  gener¬ 
ated  by  projecting  the  localized  face  area  on  the  selected  principal  components  ob¬ 
tained  from  the  principal  component  analysis  (PCA)[  12). 

On  the  other  hand,  in  order  to  extract  low-level  features  of  an  input  auditory  signal, 
we  consider  mel-frcquency  eepstral  coefficients  (MFCC)  feature  extraction  method 
which  is  commonly  used  in  HMM  Tool  Kit  (HTK)  [15J.  In  the  auditory  feature  ex¬ 
traction  block,  input  signal  is  pre-emphasized  through  a  first-order  digital  filter, 
whose  transfer  function  is  given  as  1  -  0.97z  1  to  make  the  signal  spectrally  more 


Input 

image 
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Sound 


Fig,  1.  Proposed  multiple  sensory  integrated  model  for  human  augmented  cognition;  r:red.  g; 
green,  b:b!ue.  Filter:  skin  color  filter,  I  intensity,  Eredge,  RG:  normalized  red-green  color 
opponent.  CSD&N:  center  surround  difference  and  normalization  algorithms,  1.  intensity  fea¬ 
ture  map,  E:  edge  feature  map,  c:  color  opponent  feature  map,  Adaboost:  adaptive  boost,  STFT: 
short  time  Fourier  transform.  Log:  logarithm,  DCT:  discrete  cosine  transform.  PC  A:  principal 
components  analysis. 
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flattened.  Then,  non-stat ionary  speech  signals  are  windowed  by  25  ms-long  Hamming 
window  with  a  frame  rate  of  10  ms  for  applying  short-time  Fourier  transform  (STFT). 
The  magnitude  spectrum  of  each  frame  is  weighted  and  summed  up  to  make  24  filter- 
bank  outputs  according  to  mel-scalc.  The  outputs  of  mel-scale  filterbank  are  loga¬ 
rithmically  scaled  and  converted  into  12-order  MFCCs  by  discrete  cosine  transform 
(DCT).  Then,  these  cepstral  coefficients  are  liftered  for  pragmatic  reasons  and  these 
arc  packed  with  log  energy  feature  of  the  frame  into  13-dimension  feature  vector. 

Finally,  the  human  identification  is  conducted  by  a  multiple  sensory  fusion 
adaboosting  model  using  integrated  features  of  facial  form  features  as  visual  features 
and  MFCC  as  auditory  features. 

2.1  Visual  Feature  Extraction 

Face  detection  is  one  of  important  keys  to  enhance  the  performance  of  human  identifi¬ 
cation.  Even  though  the  conventional  face  detection  models  based  on  an  Adaboost 
algorithms  show  good  performance  in  real  time  environments,  it  still  has  troubled  with 
false  positive  detection  rate  and  heavy  computational  load  in  complex  environment.  In 
order  to  enhance  those  problems,  we  consider  a  localizing  method  for  face  candidate 
regions,  which  is  based  on  skin  color  preferable  attention  model.  The  proposed  method 
effectively  reduces  the  region  of  interesting  area  in  a  complex  input  visual  scene. 

For  localizing  the  face  candidate  areas,  wc  consider  the  skin  color  filtered  intensity, 
R-G  color  opponent,  and  its  edge  feature,  which  are  used  as  inputs  for  skin  color  pref¬ 
erable  attention  model.  Thus,  after  extracting  r,  g,  and  b  color  features  from  input 
color  image,  the  intensity  and  normalized  red(R)  and  green  (G)  color  features  arc 
extracted,  which  are  known  to  effect  on  reducing  influence  of  luminance  like  human 
visual  system  do  [  1 1  ]. 

The  skin  color  filtered  intensity  feature  is  extracted  from  R,  G,  and  B  satisfying  the 
dedicated  ranges  of  R,  G,  and  B  shown  in  the  following  rules  in  Eq.  ( 1 )[  1 1  ]. 

r  >  95,  g  >  40,  b  >  20  and 

wax  { r,  g ,  /?)  —  min  { r,  g ,  b]  >  1 5  and  ( 1 ) 

I  r  -  g  l>  1 5  and  r  >  g  and  r  >b 

As  a  previous  work,  the  R-G  color  opponent  feature  has  been  shown  that  it  plays  a 
more  robust  contribution  factor  to  discriminate  characteristics  between  face  and  non¬ 
face  area  than  other  color  opponent  features  [9-1  1].  Therefore,  R-G  color  opponent 
feature  is  considered  one  of  face  color  preferable  features.  In  order  to  enhance  of  edge 
magnitude  for  candidate  face  areas  in  a  complex  scene,  wc  also  consider  the  edge  of 
R-G  color  opponent  feature  which  is  construed  by  Eq.  (2)  and  the  sobel  edge  operator 
is  applied  as  an  edge  operator  [16]. 

tf-G=l/e-GI  (2) 

Then,  wc  consider  the  on-center  and  off-surround  operation  by  the  Gaussian  pyramid 
images  with  different  scales  from  0  to  n- th  level  whereby  each  level  is  made  by  the 
sub-sampling  of  2n,  thus  it  is  able  to  construct  3  feature  bases  such  as  intensity  (1),  and 
the  edge  (E),  and  color  (R  G).  Then,  the  ccnter-surround  features  are  constructed  by 
the  difference  operation  between  the  tine  and  coarse  scales  in  the  Gaussian  pyramid 
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images  (11].  Consequently,  the  three  feature  maps  such  as  /  ,  E  ,  and  C  ,  where  stand 
for  intensity,  edge,  and  R  G  color  opponency,  can  be  obtained  by  the  center-surround 
difference  algorithm  (11]. 

A  saliency  map  (SM)  is  constructed  by  the  normalized  summation  of  those  three 
feature  maps  as  shown  in  Eq.  (3). 


(3) 


SM  =  Nonn(I  +  E  +  C) 


After  the  face  candidate  areas  arc  segmented  by  a  labeling  process  for  binarized  sali¬ 
ency  map,  which  is  obtained  by  the  Otsus  threshold,  the  localized  face  candidate 
areas  are  used  as  input  of  the  Adaboost  algorithms  for  verifying  the  face  regions  [  16, 
14].  Finally,  the  PCA  extracts  facial  features  for  recognizing  human  faces  [12]. 

2.2  Auditory  Feature  Extraction 

Auditory  features  contained  in  a  speech  signal  are  able  to  be  categorized  into  three 
kinds;  linguistic  message,  speaker  information,  and  acoustic  channel  characteristics. 
In  order  to  extract  only  features  reinforcing  human  recognition,  we  need  to  separate 
speaker  information  from  others  as  much  as  possible.  According  to  speech  synthesis 
model,  linear  prediction  (LP)  coefficients  can  be  good  features  well  representing 
vocal  tract  excitation  which  is  valuable  speaker  information.  But  in  practice,  MFCC 
feature  extraction  method  works  well  in  adverse  condition  as  well  as  in  normal  condi¬ 
tion.  Moreover,  MFCC  features  are  good  at  both  speaker  recognition  system  and 
speech  recognition  system  with  less  computation  complexity,  so  we  use  MFCC  fea¬ 
tures  for  human  identification  [6 1 . 

MFCC  features  are  usually  combined  with  additional  features  such  as  the  first-  and 
second-order  delta  features  which  reduce  the  word  recognition  error.  In  the  proposed 
multiple  sensory  integration  model,  these  additional  delta  coefficients  are  not  only 
insignificant  for  speaker  discrimination  in  the  aspect  of  perception  but  also  likely  to 
cause  worse  recognition  results  as  well-known  paradox,  ‘curse  of  dimensionality’ 
1 17].  Hence,  we  do  not  use  first-  and  second  order  delta  coefficients. 

Input  sound  samples  are  pre-emphasized  through  a  first-order  digital  filter  whose 
transfer  function  is  given  as  1  -  0.97  /  1 .  This  filter  has  almost  linearly  increasing 
frequency  response,  so  speech  signals  become  spectrally  more  flattened  after  filtering. 
Then,  using  a  Hamming  window  which  is  given  by  Eq.(4),  speech  signals  are  split 
into  25  ms  short-time  segments  called  as  frames  that  arc  windowed  at  every  10  ms 
time  advance. 


(4) 


Each  frame  is  /ero-padded,  and  taken  short-time  Fourier  transform  (STFT)  makes  its 
spectrum.  The  magnitude  values  of  the  spectrum  are  weighted  by  24  triangular  windows 
that  are  centered  at  each  of  mel-scale  frequency  points  half-overlapping  with  adjacent 
bands  and  summed  up  to  make  24  band  outputs.  Typical  mel-scale  is  given  by  Eq.  (5). 


(5) 
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From  the  above  procedure,  the  Ath-band  output  at  frame  /,  which  is  scaled  loga¬ 
rithmically  to  produce  a  log-speetral  parameter[18],  is  represented  as  Eq.(6),  where  sk 
and  ek  denote  the  start  and  end  points  of  the  Ath-band,  respectively. 


Ok(t)  =  log 


2></)K</.')| 


V/= 


(6) 


/denotes  the  frequency  index,  \S0,(f,t)\  represents  the  magnitude  spectral  value  at/and 
/,  and  cQk(f)  is  a  weight  function  corresponding  to  a  triangular  window  of  the  Ath-band. 

Then.  MFCCs  are  obtained  by  taking  the  DCT  to  the  log-spectral  parameters.  The 
/th  coefficient  of  A/-order  MFCCs  at  frame  /,  c,{t),  is  expressed  as  Eq.(7)[  19],  where  K 
denotes  the  number  of  mel-scaled  bands. 


c,(t)  =  YiotU)cQS 


ri(k-  0.5);r' 

v  K  j 


(7) 


The  principal  advantage  of  the  eepstral  coefficients  is  that  they  are  generally  decorre- 
lated  and  allow  a  diagonal  covariance  to  be  used  in  a  elassifier.  One  minor  problem  is 
that  higher  order  eepstra  are  numerically  very  small  and  this  results  in  related  parame¬ 
ters  sueh  as  covariances  having  a  wide  range.  Actually,  it  does  not  affect  to  perform¬ 
ance  of  a  classifier,  but  for  pragmatic  reasons  such  as  storing  data  in  limited  precision, 
displaying  parameters,  etc.,  we  re-scale  the  eepstral  coefficients  to  have  similar  mag¬ 
nitudes  by  following  Eq.  (8),  where  L  denotes  the  littering  parameter. 


c](t)  = 


.  L  .  ni 
l-l — sin  — 

2  Lj 


eft) 


(8) 


2,3  Adaptive  Boosting  for  People  Recognition 


As  we  mentioned  above,  we  obtain  visual  features  from  each  face  image  using  the 
PC  A  algorithm  and  extract  auditory  features  by  MFCC  feature  extraction  method. 
After  feature  extraction,  we  need  to  construct  a  classifier  to  identify  the  person.  In  the 
proposed  multiple  sensory  integration  model,  we  use  the  Adaboost  algorithm  to  inte¬ 
grate  visual  and  auditory  features  and  identify  a  human.  The  Adaptive  boosting  algo¬ 
rithm  generally  integrates  the  weak  classifier’s  results  and  uses  the  weighted  voting 
method  to  construct  a  strong  classifier  as  shown  in  Eq.  (9)  [13]. 


H  U)  =  sign 


\/=i 


x :  Input 

H  :  Final  hypothesis 

at  :  Weight  for  weak  classifier 

lit  :  Weak  classifier 


(9) 
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The  classification  result  of  each  simple  classifier,  ht{x) ,  has  the  value  of  I  or  -1.  And 
the  weight,  a;(.v)  is*  calculated  by  Eq.  (10). 


<*t 


(10) 


The  weight  becomes  greater  when  error  rate,  €t  ,  is  smaller  value.  Therefore,  the  more 
accurate  classifier,  the  higher  weight  it  has.  In  order  to  build  a  strong  classifier,  //(. v), 
we  need  many  weak  classifiers.  We  need  200  weak  classifiers  since  the  sizes  of  ex¬ 
tracted  features  are  160  and  130  for  visual  and  auditory,  respectively.  Actually,  the 
lengths  of  auditory  features  are  different  from  their  lengths  of  raw  data.  If  someone 
pronounced  a  word  during  long  time,  the  length  of  feature  is  long,  or  during  short 
time,  the  length  of  features  is  short.  Therefore  we  need  to  make  them  to  have  the  same 
length.  In  order  to  do,  we  divide  the  auditory  features  to  10  sections  and  use  average 
value  of  each  section.  Finally  we  obtain  130  auditory  features  because  each  section 
consists  of  13  values. 

According  to  the  Adaboost  algorithm,  weak  classifiers  will  be  accepted  if  they 
have  an  error  rate  with  below  50%  [13].  Therefore  we  build  each  weak  classifier 
with  a  single  threshold.  Initial  threshold  is  set  by  median  value  of  feature  in  positive 
group  and  average  value  of  feature  in  negative  group.  Next,  we  modify  the  thresh¬ 
old  in  the  range  of  -50%  and  +50%  by  increasing  1%.  A  threshold  that  has  mini¬ 
mum  error  rate  is  determined  as  a  final  threshold.  Every  weight  and  threshold  is 
calculated  in  learning  process,  and  recognition  process  is  performed  using  these 
parameters. 


3  Experimental  Results 

3.1  Hardware  Platform  and  Scenario  for  Experiment 

For  the  experiment,  we  have  developed  a  goggle  based  human  augmented  system  for 
supporting  user  to  provide  information  about  unknown  or  not  memorized  participant, 
by  presenting  contexts  on  the  screen  in  a  meeting  and  conference  situation.  As  shown 
in  Fig.  2,  the  system  have  equipped  with  2-phinhole  CCD  camera  with  2  Microrobot's 
USB  image  grabbers  for  recognizing  scene  and  user  gaze,  and  2  TCM 100  microphone 
with  Terra  Tech’s  6fire  USB  amplifier  for  localizing  and  recognizing  auditory  source 
signal.  And,  we  make  a  meeting  scenario  for  demonstrating  the  performance  of  the 
proposed  system  which  is  sequentially  consisted  of  entrance  and  introduction  of  par¬ 
ticipant,  introduction  of  participant  after  participant’s  seating  down,  discussion  and 
presentation,  and  decision  and  conclusion  of  meeting.  Then,  we  took  14  videos  in  a 
different  illumination  environment  varying  from  140  to  412  lux  with  three  people 
who  are  participants  in  a  meeting,  through  which  we  obtained  visual  and  auditory 
database  for  experiments  and  verification  of  the  developed  system  under  the  provided 
scenario. 
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Specification 

(1)  Camera(2) 

*  %"  Sony  color  CCD 

^  4.3mm  super  cone  pinbolc  lens 

*  Size  :  12*1 2mm 

(2)  Grabber(2) 

v'  My  Vision  USB  Dx 

(3)  AMP  and  Microphone  (Mic) 

V  Mic:  Audio  stream  TCM 100  (2) 

V  AMP  Terra  Tech's  DMX  6 fuc  USB 

(4)  Recording  Software  Tool  Environment 

^  Language  .  Visual  C++  6.0(MFC) 

V  Library;  vfw,  Grabber  Lib 
^  Codec  Microsoft  Video  I 


Fig.  2.  Developed  goggle  based  augmented  cognition  platform  based  on  visual  and  auditory 
sensory  integrated  human  identification 

3.2  Experimental  Results 

In  the  face  detection  experiment,  we  eaptured  420  images  from  14  videos  for  intro¬ 
duction  of  participant  when  entering  and  introduction  of  participant  after  seating 
down  in  a  meeting  place. 

For  speeeh  acquisition,  the  sampling  frequency  of  A/D  converter  for  input  speech 
signal  is  typically  16  kHz.  and  A/D  precision  is  16  bits.  In  practice,  a  16  kHz  sampling 
rate  is  sufficient  for  the  speeeh  bandwidth  (8  kHz).  It  is  empirically  verified  that  the 
word  recognition  error  stops  being  reduced  when  the  sampling  rate  is  increased  over 
16  kHz  [  17|.  Therefore,  parameters  for  an  A/D  converter  are  determined  as  a  16  kHz 
sampling  rate  and  16-bits  precision.  In  frequency  analysis,  we  set  short-time  segment 
width  to  25  ms  as  a  compromise  between  the  stationarity  assumption  and  the  fre¬ 
quency  resolution.  Also,  we  set  frame  shift  to  10  ms  which  is  typically  used.  Related 
to  these  parameters,  25  ms  length  under  a  16  kHz  sampling  rate  corresponds  to  400 
samples  in  a  frame,  so  5 1 2  FFT  points  are  automatically  computed  to  generate  a  spec¬ 
trum.  The  number  of  channels  in  the  mel-seale  interbank  is  24  because  too  many 
channels  may  cause  unwanted  fluctuation  on  the  spectral  envelop,  and  too  few  chan¬ 
nels  may  smooth  details  f 20].  Finally,  the  dimensionality  of  a  parametric  vector  is 
reduced  to  12  again  by  taking  DCT  to  convert  log-spectral  parameters  into  MFCCs. 
These  features  form  a  13-order  feature  vector  with  an  additional  log-energy  feature 
for  the  current  frame.  As  a  result,  a  13-order  auditory  feature  vector  is  supplied  for 
adaptive  boosting  module  at  every  10  ms. 

Moreover,  in  order  to  evaluate  our  multiple  sensory  integration  model,  we  use  54 
multiple  sensory  data,  consisting  54  face  images  and  54  speeeh  data,  obtained  from  14 
video  database  for  three  people.  Half  of  the  dataset  is  used  for  training  and  the  others 
are  used  for  evaluation.  The  adaboosting  model  for  human  identification  is  based  on  a 
two-class  classifier.  Thus,  among  27  data  for  three  persons,  9  data  for  one  person  are 
used  as  a  positive  set  and  1 8  data  for  the  other  two  persons  are  considered  as  a  nega¬ 
tive  set  and  this  classification  experiment  is  repeatedly  conducted  for  3  times  by 
changing  positive  and  negative  datasets  in  a  combination  manner. 

Fig.  3  shows  the  experimental  results  for  every  process  of  localizing  face  areas  in 
the  proposed  faee  preferable  selective  attention  model.  Because  the  proposed  face 
preferable  selective  attention  model  was  considered  skin  color  filtered  intensity  and 
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R  G  color  opponent,  and  edge  of  R  G  color  opponent,  faee  color  regions  are  naturally 
more  intensified  than  other  background  in  a  salieney  map.  Therefore,  candidate  face 
areas  are  simply  localized  though  a  labeling  based  segmenting  process  with  a  size 
constraint  in  a  binarized  salieney  map.  As  a  result,  this  processing  can  effect  on  en¬ 
hancing  computational  load  and  false-positive  detection  rate  for  the  Adaboost  face 
detection  model  as  shown  in  Fig.  3.  Table  I  also  shows  the  performance  of  the  pro¬ 
posed  face  detection  for  our  face  database.  Even  though  the  correct  detection  rate  of 
the  proposed  model  is  slightly  lower  than  that  of  the  conventional  Adaboost  faee 
detector  in  varying  illumination  environments,  the  proposed  model  shows  better  re¬ 
sults  for  the  false  positive  detection  rate  as  shown  in  Table  1 .  Moreover,  as  a  previous 
work,  we  have  shown  that  the  proposed  faee  detection  model  not  only  shows  a  good 
performance  but  also  enhances  the  computational  load  and  the  false  positive  detection 
rate  for  an  open  database  [II]. 


Fig.  3.  Experimental  results  of  the  proposed  face  detection  model 

Fig.  4  shows  the  human  identification  performance  of  the  proposed  multiple  sensory 
integration  model.  Through  this  experiment,  we  aimed  to  show  that  the  human  identifi¬ 
cation  performance  ean  be  enhanced  much  by  considering  sound  information  together 
with  visual  information  than  only  considering  visual  information.  As  shown  in  Fig.  4, 
when  we  considered  only  visual  information,  the  performance  is  low  as  79.0#.  How¬ 
ever,  the  performance  becomes  very  much  higher  as  98.9#  by  considering  both  sound 
information  and  visual  information  than  79.0#  by  considering  only  visual  information, 
which  shows  that  the  proposed  multiple  sensory  integration  model  can  plausibly 
enhance  the  human  identification  performance.  On  the  other  hand,  in  this  experiment, 
we  obtained  1  (X)#  human  identification  performance  when  we  considered  only  sound 
information,  which  has  been  eaused  by  considering  only  clear  sound  data  in  the 
experiments.  However,  in  the  case  of  the  image  data  involved  in  the  experiments,  the 
image  database  was  obtained  under  varying  illumination  conditions.  Thus,  we 
have  verified  the  performance  of  the  proposed  multiple  sensory  integration  model  by 
showing  that  the  human  identification  performance  ean  be  enhanced  by  additionally 
considering  sound  features  as  well  as  visual  features  in  an  integrated  manner. 

Table  I.  Comparison  of  face  detection  performance 


Proposed  detection  model 

Conventional  Adaboost 

True  positive 

96.2%(404/420) 

98.3%(4 13/420) 

False  positive 

4.5%(  19/423) 

1 1.2%(52/465) 
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Fig.  4.  The  performance  of  multiple  sensory  integration  based  human  identification  model 


4  Conclusion 

We  proposed  an  adaptive  boosting  based  multiple  sensory  integration  model  for  hu¬ 
man  identification  by  combining  visual  and  auditory  features.  In  order  to  extract  vis¬ 
ual  features  robustly,  we  consider  not  only  the  face  preferable  selective  attention 
model  for  enhancing  the  computational  load  and  false  positive  detection  rate  of 
Adaboost  face  detector,  but  also  a  PC  A  approach  for  extracting  proper  facial  features 
as  well  as  reducing  dimension  of  facial  features.  In  addition,  wc  also  considered  the 
well-known  MFCC  for  extracting  auditory  features  robustly. 

Even  though  the  multiple  sensory  integration  based  human  identification  model  has 
shown  the  plausible perfomanee  in  the  experiments  using  our  database  reflecting  low 
illumination  environment,  it  seems  to  be  hard  to  deploy  the  proposed  model  in  a  real 
system  since  the  proposed  model  is  lack  of  incremental  mechanism  in  online  feature 
extraction  and  learning  concept  Therefore,  as  a  further  work,  wc  consider  online 
increment  learning  concept  for  developing  a  more  advanced  human  augmented  cogni¬ 
tion  system.  Moreover,  we  are  considering  more  experiments  using  noisy  sound  data 
in  order  to  verify  the  performance  of  the  proposed  model. 
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Abstract.  This  paper  proposes  a  method  of  simultaneous  localization  and  map¬ 
ping  based  on  computational  intelligence  for  a  robot  partner  in  unknown  envi¬ 
ronments.  First,  we  propose  a  method  of  topological  map  building  based  on  a 
growing  neural  network.  Next,  we  propose  a  method  of  localization  based  on 
steady-state  genetic  algorithm.  Finally,  we  discuss  the  effectiveness  of  the  pro¬ 
posed  methods  through  several  experimental  results. 

Keywords:  Simultaneous  Localization  and  Mapping,  Informationally  Struc¬ 
tured  Space,  Mobile  Robots,  Neural  Networks,  Genetic  Algorithm. 


1  Introduction 

Recently,  robots  have  been  familiar  for  people,  and  wc  expect  human-friendly  robots 
co-existing  in  human  living  environments.  Such  a  robot  needs  various  capabilities 
such  as  learning,  inference,  and  prediction  for  human  interaction,  and  such 
capabilities  are  interconnected  each  other  in  the  total  system.  In  the  previous  works, 
multi-stratcgie  learning  has  been  discussed  to  integrate  multiple  inference  types 
and/or  computational  mechanisms  in  one  learning  system  [1],  c.g.,  integration  of 
symbolic  and  numerical  learning,  a  hybrid  computation  of  discrete  space  and 
continuous  space,  integration  of  stochastic  search  and  deterministic  heuristic  search, 
and  others.  A  multi -strategic  approach  of  path  planning  and  behavioral  learning  [2],  a 
reinforcement  learning  based  on  value  iteration  and  policy  iteration,  and  others  have 
been  proposed  in  the  field  of  intelligent  robotics. 

A  human-friendly  robot  should  have  an  environmental  map  for  co-cxisting  with 
people,  but  it  is  very  difficult  to  build  the  environmental  map  beforehand.  The 
important  functions  are  to  build  tip  an  environmental  map  and  to  estimate  and  correct 
the  self-location.  The  robot  builds  up  an  environmental  map  according  to  the  position 
and  posture  of  the  robot,  while  the  robot  estimates  the  position  and  posture  according 
to  the  built  environmental  map.  This  is  a  mutual  nesting  structure,  and  is  well  known 
as  a  simultaneous  localization  and  mapping  (SLAM)  [3-10].  SLAM  is  also  considered 
as  one  of  the  multi-strategic  learning  methods.  Map  building  by  mobile  robots  has  a 
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long  history  [11-18].  There  are  two  main  methods  of  metric  approach  and  topological 
approach.  In  the  metric  approach,  an  environment  is  represented  by  finite  discrete 
space  or  a  set  of  polygons.  For  example,  in  a  cell  decomposition  method,  a  two- 
dimensional  workspace  is  often  divided  into  MxN  rectangular  eells.  In  the  topological 
approaches,  an  environment  is  represented  by  a  list  of  connectivities  of  places. 
Skeletonization  methods  direetly  generate  intermediate  points  and  paths,  while  the 
cell  decomposition  methods  generate  eollision-free  space.  In  the  skeletonization 
methods,  collision-free  paths  are  basically  generated  according  to  polygonal  objects 
approximated  in  a  workspace.  Visibility  graph  consists  of  edges  connecting  visible 
pairs  of  vertices  of  the  polygonal  objects.  In  the  visibility  graph,  the  shortest  path 
between  two  points  can  be  generated  easily  by  selecting  edges.  However,  it  is 
dangerous  for  a  mobile  robot  to  move  along  the  generated  path,  because  the  path  is 
adjacent  to  the  vertices  of  the  polygonal  objects.  To  overcome  this  problem,  a 
Maklink  graph  can  be  used  to  generate  a  safe  path.  This  method  can  be  considered  as 
one  of  the  approximated  Voronoi  diagrams.  In  the  Maklink  graph,  a  candidate  point  is 
represented  as  a  middle  point  between  two  vertices,  and  a  path  is  generated  by 
connecting  some  intermediate  points.  Although  the  generated  path  is  safe,  it  might  not 
be  the  shortest. 

Next,  we  explain  the  background  of  the  localization  method  for  mobile  robots. 
Kalman  filters  have  been  applied  for  the  localization  in  case  of  small  and  incremental 
dead-reckoning  errors,  and  multi-hypothesis  Kalman  filters  have  been  applied  for  the 
localization  based  on  beliefs  using  the  mixture  of  Gaussians.  Furthermore,  Monte 
Carlo  localization  has  been  applied  for  the  localization  [  10).  Monte  Carlo  localization 
uses  the  belief  by  a  set  of  samples  called  particles,  and  this  method  is  known  as  a 
particle  filter.  The  particle  filter  is  one  of  non-parametric  Bayesian  filtering  methods. 
The  particle  11  Iter  can  approximately  represent  the  posteriors  by  a  random  collection 
of  weighted  particles  of  the  desired  distribution.  As  the  number  of  samples  becomes 
very  large,  this  Monte  Carlo  characterization  becomes  an  equivalent  representation  to 
the  usual  functional  description  of  the  posterior  probability  density  function,  and  the 
sequential  importance  sampling  filter  ean  approach  the  optimal  Bayesian  estimate. 
However,  the  particle  filter  takes  much  computational  time  and  cost. 

In  our  previous  works  [19],  we  used  image  processing  to  estimate  the  self-loeation 
of  the  robot  based  on  the  cell  decomposition  method,  but  it  is  very  difficult  to  deal 
with  the  environmental  lighting  conditions.  Furthermore,  the  aeeuraey  of  the  map 
building  depends  strongly  the  granularity  of  the  map.  Therefore,  we  proposed  a 
topological  map  building  method  based  on  a  growing  neural  network  as  a  topological 
approach) 4).  A  growing  neural  network  can  add  neurons  and  their  connections  to  the 
network.  Furthermore,  we  applied  a  steady-state  GA  (SSGA)  to  update  the  estimated 
the  self-position  of  the  robot  by  using  the  measured  distance  and  topological  map. [4] 
The  proposed  method  was  applied  to  SLAM  in  a  city  hall,  a  parking  area,  and 
university  cafeteria,  and  we  compared  the  experimental  result  of  the  proposed  method 
with  that  of  particle  filter[2 1  ].  However,  the  developed  mobile  robot  is  too  large  to 
use  as  a  robot  partner  at  home.  Therefore,  we  develop  a  small  size  of  robot  partner, 
and  apply  the  proposed  method  to  a  living  room  in  this  paper.  Next,  we  discuss  the 
effectiveness  of  the  proposed  method  through  experimental  results. 

This  paper  is  organized  as  fellows.  Seetion  2  explains  the  hardware  and  control 
method  of  the  mobile  robot.  Section  3  proposes  a  method  for  topological  map  building 
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based  on  growing  neural  network  and  steady-state  genetic  algorithm.  Section  4  shows 
that  the  robot  can  perform  simultaneous  localization  and  mapping  by  using  the  pro¬ 
posal  method. 


2  SLAM  for  Informationally  Structured  Space 

2.1  Informationally  Structured  Space 

Recently,  various  types  of  remote  observing  systems  of  elderly  people  living  alone  in 
a  house  have  been  developed  for  the  detection  of  their  emergency  as  the  population  of 
elderly  people  increases.  The  introduction  of  coexisting  human-friendly  robot  partners 
are  one  of  possible  solutions  to  realize  the  remote  observation  of  elderly  people. 
Wireless  sensor  networks  realize  to  gather  the  huge  data  on  environments  for  remote 
monitoring.  However,  it  is  very  difficult  to  store  all  of  huge  data  in  real  time.  Fur¬ 
thermore,  some  features  should  be  extracted  from  the  gathered  data  to  obtain  the 
required  information.  The  accessibility  within  environmental  information  is  essential 
for  both  people  and  robots.  Therefore,  the  environment  surrounding  people  and  robots 
should  have  a  structured  platform  for  gathering,  storing,  transforming,  and  providing 
information.  Such  an  environment  is  called  informationally  structured  space  (Fig.l). 
The  structuralization  of  informationally  structured  space  realizes  the  quick  update  and 
access  of  valuable  and  useful  information  for  users.  If  the  robot  can  share  the  envi¬ 
ronmental  information  with  people,  the  communication  with  people  might  become 
very  smooth  and  natural. 


Fig.  1.  The  concept  of  informationally  structured  space 


Fig.  2.  An  example  of  SLAM 
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A  robot  partner  needs  the  intelligent  capabilities  of  environmental  map  building  and 
physical  behavioral  learning  in  a  real  world  at  least.  The  map  building  of  the  co¬ 
existing  environment  is  performed  by  SLAM  Figure  2  shows  the  environment  and  its 
corresponding  map  as  a  result  of  SLAM.  If  a  robot  partner  automatically  performs 
SLAM,  it  is  easy  to  perform  the  remote  monitoring. 

2.2  A  Robot  Partner:  MOBiMac 

We  developed  a  partner  robot;  MOBiMac  shown  in  Fig. 3.  Two  CPUs  are  used  for  the 
interaction  with  a  human  and  the  control  of  the  robotic  behaviors.  The  robot  has  two 
servo  motors,  eight  ultrasonic  sensors,  a  laser  range  finder  (LRF)  and  a  CCD  camera. 
An  ultrasonic  sensor  can  measure  the  distance  to  objeets.  The  LRF  can  measure  the 
distances  up  to  approximately  4,095  mm  in  682  different  directions  where  the  cover¬ 
ing  measurement  range  is  240°.  Therefore,  the  robot  ean  take  various  actions  such  as 
collision  avoidance,  human  tracking,  and  line  traeing.  The  behavior  modes  of  this 
robot  are  human  detection,  human  communication,  behavior  learning,  and  behavioral 
interaction.  The  communication  with  a  person  is  performed  by  utterance  as  the  result 
of  voice  recognition  and  gestures  as  the  result  of  human  motion  recognition 


Fig.  3.  Robot  Partner;  MOBiMac 

Various  intelligent  methods  for  mobile  robots  have  been  proposed  such  as  produc¬ 
tion  rules,  Baysian  networks,  neural  networks,  fuzzy  inference  systems,  and  elassifier 
systems.  Wc  have  applied  fuzzy  inference  systems  to  represent  behavior  rules  of 
mobile  robots,  because  the  behavioral  rules  can  be  designed  easily  and  intuitively  by 
human  linguistic  representations.  A  behavior  of  the  robot  can  be  represented  using 
fuzzy  rules  based  on  simplified  fuzzy  inference.  In  general,  a  fuzzy  if-then  rule  is 
described  as  follows. 

If  X\  is  Aimi  and  x\f  is  AiM 

Then  yx  is  and  ...  and  V/v  is  wiN 

where  Atf  and  witk  arc  the  Gaussian  membership  function  for  the  y'th  input  and  the 
singleton  for  the  kih  output  of  the  /th  rule;  M  and  N  are  the  numbers  of  inputs  and 
outputs,  respectively.  Fuzzy  inference  is  performed  by, 
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(  (*,-«,  ,)2^ 
//A  (x,)=exp^-  L—A  j 


(1) 


a  =  IWo 


(2) 


*  = 


(3) 


where  citJ  and  btj  are  the  central  value  and  the  width  of  the  membership  function  A iyf, 
R  is  the  number  of  rules.  Outputs  of  the  robot  are  output  levels  of  the  left  and  right 
motors  (N=  2).  Fuzzy  controller  is  used  for  collision  avoidanee  and  target  traeing  be¬ 
haviors.  The  inputs  to  the  fuzzy  controller  for  collision  avoidance  are  the  measured 
distance  to  the  obstacle  by  LRF  (M(.= 8).  The  number  of  directions  of  LRF  is  reduced 
in  8  by  choosing  the  minimal  distance  in  each  sensing  range.  The  inputs  to  the  fuzzy 
controller  for  target  traeing  are  the  estimated  distance  to  the  target  point  and  the  rela¬ 
tive  angle  to  the  target  point  from  the  moving  direction  (M(=2). 

In  general,  a  mobile  robot  has  a  set  of  behaviors  for  achieving  various  objectives, 
and  must  integrate  these  behaviors  according  to  the  environmental  conditions.  There¬ 
fore,  we  proposed  the  method  for  multi-objective  behavior  coordination.  The  multi¬ 
objective  behavior  coordination  can  integrate  outputs  of  several  behaviors  aeeording 
to  the  time-series  of  perceptual  information,  while  the  original  subsumption  architec¬ 
ture  selects  one  behavior.  This  multi-objective  behavior  coordination  is  composed  of 
a  sensory  network,  behavior  coordinator,  and  behavior  weight  updater.  The  sensory 
network  extracts  perceptual  information  based  on  sensing  data  and  updates  the  pa¬ 
rameters  of  sensors  recursively  according  to  the  perceptual  information.  A  behavior 
weight  is  assigned  to  each  behavior.  Based  on  eq.(3),  the  output  is  calculated  by 

A 

v‘=^ -  (4) 

)-i 


where  K  is  the  number  of  behaviors;  n ’£/,(/)  is  a  behavior  weight  of  the  jth  behavior 
over  the  discrete  time  step  t.  By  updating  the  behavior  weights,  the  robot  can  take  a 
multi-objective  behavior  according  to  the  time  series  of  perceptual  information.  The 
update  amount  of  eaeh  behavior  is  calculated  as  follows. 


■  A  >V£/,  ' 

‘  dwt  j  dwu 

•  *  •  dw\  ! 

sl'\  * 

Am*/, 

= 

dw2t 

si  j 

Att'sr, 

—  dwK, 

. 

(5) 
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where  si)  is  the  parameter  on  the  perceptual  information;  L  is  the  number  of  percep¬ 
tual  inputs.  This  method  can  be  considered  as  a  mixture  of  experts  if  the  behavior 
coordinator  is  considered  as  a  gating  network, 

3  Simultaneous  Localization  and  Mapping 

3.1  Growing  Topological  Map  Building 

Map  budding  can  be  regarded  as  one  of  unsupervised  learning  approaches  where 
sampling  data  are  noisy  and  imprecise,  because  the  measurement  noise  is  included 
in  the  measured  data  (Fig. 4).  Self-organizing  map  (SOM)  is  often  applied  for 
extracting  a  relationship  among  measured  data,  since  SOM  can  learn  the  hidden 
topological  structure  from  the  data  (22].  The  original  SOM  used  the  pre-defined 
number  of  nodes.  Neural  gas  has  been  also  used  for  constructing  a  topological  map, 
and  furthermore,  growing  neural  gas  is  used  for  incremental  learning  of  the 
topological  structure  [23-27).  Local  error  measures  are  used  for  determining  the 
place  to  insert  new'  nodes.  The  competitive  Hebbian  rule  generates  the  edges 
between  nodes. 

The  addition  of  nodes  and  the  generation  of  the  edges  between  nodes  can  be 
applied  to  topological  map  building  (Fig. 5).  Therefore,  we  proposed  a  topological 


Fig.  4.  Topological  map  building 


(a)  Node  addition  to  the  map 
;ienou  _  @  Robot 


(b-1 )  Node  deletion  based  on  density  (b-2)  Node  deletion  based  on  phase 
Fig.  5.  Growing  topological  Mapping  Building 
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map  building  method  based  on  the  concept  of  growing  neural  gas.  We  explain  the 
method  in  the  following.  At  the  first  measurement,  the  measurement  points  are 
added  as  the  initial  nodes  of  the  topological  map.  Afterward,  the  topological  map  is 
updated  according  to  the  measured  data. 

When  the  /th  reference  vector  of  the  topological  map  is  represented  by  r,,  the 
Euclidean  distance  between  an  input  vector  and  the  /th  reference  vector  is  defined  as 


(6) 


Where  r,  =  (rlJ9  r2/,  ...  ,  rNJ) .  Next,  the  Ath  output  node  minimizing  the  distance 
dj  is  selected  by 


k  =  argmin{||V-r.||} 


(7) 


The  selected  output  node  is  the  nearest  point  on  the  topological  environmental  map 
according  to  the  measured  distance.  Furthermore,  the  reference  vector  of  the  /th  out- 
put  node  is  trained  by 


(8) 


where  £  is  a  learning  rate  (0<  £<1 .0);  Cu  a  neighborhood  function  (0<{T*./<1.0). 

The  number  of  nodes,  //  is  gradually  increased  when  there  is  no  node  corre¬ 
sponding  to  input  data.  The  number  of  inputs  in  each  sampling  of  the  distance 
information  is  L  (L= 682).  We  show  the  procedure  of  the  topological  map  building; 

Step  I:  Initialization  of  the  map  based  on  the  first  measurement;  r=l . 

Step  2:  Distance  measurement  (z(/))  and  Motion  Output  (y(r)) 

Step  3:  for  i=  l  to  L  do 

Step  4:  Select  Ath  node  according  to  the  distance 

Step  5:  if  dk>  Dmax  then  nlwde++\  add  rnmit 


otherwise,  update  r* 


Step  6:  end  / 

Step  7:  Generate  a  set  Oi t  composed  of  near  nodes  with  respect  to  the  Ath  node. 

Step  8:  if  the  number  of  nodes  in  O *  is  larger  than  the  predefined  number  n^x  then 
the  least  selected  node  is  removed  from  the  topological  map. 

Step  9:  Remove  unnecessary  nodes 
Step  10:  t+  + 

Step  11:  go  to  step  2 

This  method  is  composed  of  three  steps  of  node  addition  ( Step  5),  learning  (Step  5), 
and  node  deletion  (Step  7-9).  The  node  deletion  is  performed  in  order  to  remove 
unnecessary  and  crowed  nodes. 

Figure  5  shows  an  example  of  topological  map  building.  If  the  node  does  not  exist 
in  the  position  corresponding  to  the  measured  distance,  a  node  is  added  to  the  map 
(Fig.5  (a)).  If  there  are  many  nodes  crowded,  some  of  them  are  removed  from  the 
map  (Fig.5  (b)). 
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3.2  Steady-State  Genetic  Algorithm  for  Localization 

As  one  stream  of  evolutionary  computing,  genetic  algorithms  (GAs)  have  been 
effectively  used  for  solving  optimization  problems  in  robotics  [28-32 1.  GAs  can 
produce  a  feasible  solution,  not  neeessarily  an  optimal  one,  with  less  computational 
cost.  SSGA  simulates  the  continuous  model  of  the  generation,  which  eliminates  and 
generates  a  few  individuals  in  a  generation  (iteration).  A  candidate  solution  called  an 
individual  is  composed  of  numerical  parameters  of  the  revised  values  to  the  eurrent 
position  (gj  i  gj  2)  and  rotation  (gj 3).  In  SSGA,  only  a  few  existing  solutions  are 

replaced  by  new  candidate  solutions  generated  by  genetic  operators  in  eaeh 
generation  [32].  In  this  paper,  the  worst  candidate  solution  are  eliminated  and 
replaced  with  the  candidate  solution  generated  by  the  crossover  and  mutation.  We  use 
the  elitist  crossover  and  adaptive  mutation.  Elitist  crossover  randomly  selects  one 
individual  and  generates  an  individual  by  incorporating  genetie  information  from  the 
selected  individual  and  best  individual  in  order  to  obtain  feasible  solutions  rapidly. 
Next,  the  following  adaptive  mutation  is  performed  to  the  generated  individual. 


(9) 


where  /  is  the  fitness  value  of  the  ith  individual, /nm  and  /min  arc  the  maximum  and 
minimum  of  fitness  values  in  the  population;  MO,  1)  indicates  a  normal  random  value; 

a,  and  /?,  are  the  coefficient  and  offset,  respectively.  In  the  adaptive  mutation,  the 

variance  of  the  normal  random  number  is  relatively  changed  aeeording  to  the  fitness 
values  of  the  population.  Fitness  value  is  calculated  by  the  following  equation. 


GO) 


where  is  the  distance  between  the  measured  point  and  its  nearest  node  in  the  topo¬ 
logical  map;  f  ,  Az ,  and  Ay  are  weight  parameters  for  multi-objective  optimization. 

These  weight  parameters  are  heuristieally  determined.  Therefore,  this  problem  results 
in  the  minimization  problem.  The  population  size  is  (7,  and  the  number  of  iteration 
times  is  T.  We  show  the  procedure  of  SSGA  for  the  localization  in  the  following; 

Step  /;  Initialization  of  samples  and  importance  factors;  /=1. 

Step  2:  Distance  measurement  (z(/))  and  Motion  Output  (y(/)) 

Step  3:  for  /=1  to  G  do 
Step  4:  Adaptive  Mutation 

Step  5:  Evaluation 

Step  6:  end  / 

Step  7:  for  /=  1  to  T  do 
Step  8:  Least  Fitness  Selection 

Step  9:  Elitist  Crossover 

Step  10:  Adaptive  Mutation 
Step  11:  end/ 

Step  12:  Update  the  self-position  according  to 
the  best  individual 
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Step  13:  /++; 

Step  14:  goto  Step  2 

In  the  adaptive  mutation  in  Step  4,  some  individuals  partially  inherit  genetic 
information  from  the  previous  population. 

4  Experimental  Result 

This  section  shows  experimental  results  of  the  proposed  method.  Figure  7  shows  an 
environment  of  an  elevator  hall  where  the  size  of  this  area  is  approximately  11  x  15 
[m]  and  there  are  many  obstacles.  The  robot  starts  at  the  initial  position  in  the  lower 
right  in  Fig.7  (a),  moves  along  the  red  line,  and  goes  back  the  initial  position.  The 
maximal  number  of  nodes  is  1,000.  The  population  size  of  SSGA  is  50,  and  the 
evaluation  times  of  SSGA  is  500  including  the  evaluation  of  individuals  in 
the  initialization.  Because  the  robot  does  not  have  rotary  encoders  for  dead  reckoning, 
the  robot  must  perforin  localization  according  to  the  distance  information  measured 
by  LRF.  Figure  8  shows  an  experimental  result  of  the  proposed  method.  The  features 


11102[mm] 

(a)  Map  of  elevator  hall  (b)  View  of  elevator  hall 


Fig.  7.  An  experimental  environment  (Case  1 ) 


Fig.  8.  An  experimental  result  of  SLAM  (Case  1) 
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(a)Vicw  of  room  (b)  Structure  of  room’s  thing  (c)Map  of  SLAM 

Fig.  9.  Partner  robot’s  SLAM  In  the  room  (Case  2) 


of  the  environment  are  extracted  by  the  proposed  method,  and  the  topological  map  is 
generated.  In  fact,  the  number  of  obstacles  is  3  in  the  center  of  the  obtained  topologi¬ 
cal  map.  The  final  number  of  nodes  is  361 . 

Figure  9  shows  an  example  of  a  living  room  for  elderly  people  where  the  size  of 
this  area  is  approximately  5  x  2.5  [m]  and  there  are  many  obstacles  (Case  2).  Figure 
1  1  shows  an  experimental  result  of  the  proposed  method.  The  features  of  the  envi¬ 
ronment  are  extracted  by  the  proposed  method,  and  the  topologieal  map  is  generated. 
Because  the  height  of  the  table  is  nearly  equal  to  the  position  of  the  LRF  equipped 
with  the  robot  in  this  result,  the  built  map  partially  includes  the  top  board  of  the  table. 
Table  1  shows  the  number  of  nodes  used  in  the  map,  the  size  of  map,  and  the  error  of 
estimated  position  and  posture  in  the  final  state  of  SLAM.  The  obtained  result  is  effi¬ 
cient  for  the  robot  to  conduct  interaction  with  people. 


Table  1.  Status  of  the  map  obtained  by  SLAM  in  Case  2 


Number  of  nodes 

591 

Measured  distance  (size  of  the  room) 

3977  x2457  [mm] 

Error  of  the  estimated  robot  position 

X:-45  [mm]  Y:0.2  [mm] 

Error  of  the  estimated  posture 

2° 

5  Summary 

This  paper  discussed  the  SLAM  of  a  mobile  robot  based  on  computational  intelli¬ 
gence.  We  proposed  the  topologieal  map  building  method  for  SLAM  by  using  a 
growing  neural  network  and  a  steady-state  genetic  algorithm  for  the  localization.  In 
the  experiment  results  of  the  elevator  hole  in  the  university  and  a  living  room  for 
elderly  people,  the  map  building  was  successfully  done  by  the  proposed  method, 
although  the  rotary  encoders  for  dead  reckoning  are  not  equipped  with  the  robot  and 
the  moving  direction  of  the  mobile  robot  is  not  used  in  the  proposed  method.  How¬ 
ever,  it  is  difficult  to  eonduct  map  building  using  small  objects  sueh  as  legs  of  table 
and  chairs  in  the  living  room,  beeause  sueh  a  legs  is  considered  as  a  point,  not  a  line 
or  plane.  Furthermore.  SLAM  might  memorize  a  moving  objeet  as  a  static  object.  As 
a  result,  sueh  a  moving  object  ean  be  noise  in  the  map  used  for  SLAM. 
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As  a  future  work,  we  should  consider  the  object  property  obtained  from  informa¬ 
tionally  structured  space.  We  intend  to  perform  experiments  in  the  corridor  in  a  large 
size  of  floor  tn  order  to  show  the  effectiveness  of  the  proposed  method.  Furthermore, 
we  will  develop  a  topological  map  building  method  based  on  the  temporal  reliability 
in  a  dynamic  environment. 
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Abstract.  Model  selection  for  machine  learning  systems  is  one  of  the 
most  important  issues  to  be  addressed  for  obtaining  greater  generaliza¬ 
tion  capabilities.  This  paper  proposes  a  strategy  to  achieve  model  selec¬ 
tion  incrementally  under  virtual  concept  drifting  environments,  where 
the  distribution  of  learning  samples  varies  over  time.  To  carry  out  incre¬ 
mental  model  selection,  the  system  generally  uses  all  the  learning  samples 
that  have  been  observed  until  now.  Under  virtual  concept  drifting  envi¬ 
ronments,  however,  the  distribution  of  the  observed  samples  is  consider¬ 
ably  different  from  that  under  real  concept  drifting  environments  so  that 
model  selection  is  usually  unsuccessful.  To  overcome  this  problem,  the 
author  had  earlier  proposed  the  weighted  objective  function  and  model- 
selection  criterion  based  on  the  predictive  input  density  of  the  learning 
samples.  Although  the  previous  method  described  in  the  author’s  previ¬ 
ous  study  shows  good  performances  to  some  datasets,  it  occasionally  fails 
to  yield  appropriate  learning  results  because  of  the  failure  in  the  predic¬ 
tion  of  the  actual  input  density.  To  overcome  this  drawback,  the  method 
proposed  in  this  paper  improves  on  the  previously  described  method  to 
yield  the  desired  outputs  using  an  ensemble  of  the  constructed  radial  ba¬ 
sis  function  neural  networks  (RBFNNs).  Experimental  results  indicate 
that  the  improved  method  yields  a  stable  performance. 


1  Introduction 

Let  the  learning  samples  be  (x^.y^)  (b  =  1,2,*-  ■),  whose  joint  probability  dis¬ 
tribution  is  P(x,y)  =  P(y\x)P(x).  To  achieve  successful  learning  of  the  relation 
between  x  and  y:  P(y\x)  using  a  model-based  learning  machine,  the  system 
generally  uses  all  the  observed  samples.  Although  the  empirical  input  density 
P{x)  approximates  the  actual  input  density  with  an  increase  in  the  number 
of  learning  samples,  it  generally  differs  widely  from  the  the  actual  value  in  the 
early  steps  of  the  learning.  Moreover,  the  center  of  P(x)  is  usually  nonstationary. 
Such  changing  environments  are  usually  called  “virtual  concept  drifting  environ¬ 
ments.”  Because  of  the  underlying  principle  of  such  environments,  the  learning 
process  can  not  yield  the  best  model. 
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To  overcome  this  problem,  the  author  had  earlier  proposed  a  model-select  ion 
criterion  on  the  predictive  distribution  of  the  learning  samples  [I  2]  .  This 
met  hod  is  an  extended  version  of  the  learning  strategies  under  covariate  shift  (e.g 
[3]  [4])  .  Under  covariate  shift,  the  learning  input  density  P(x)  is  not  equivalent 
to  the  density  of  the  test  samples.  In  such  environments,  learning  machines  need 
to  adjust  their  parameters  to  minimize  the  following  weighted  error  function  so 
as  to  acquire  greater  generalization  capabilities. 


N 


(i) 


where  VU(x)  is  the  weight  used  for  each  sample  and  VT(x)  =  (q(x)/ P(x))x , 
where  (]{x)  denotes  the  density  of  x  for  the  test  samples  and  0  <  A  <  l  denotes 
the  flattening  parameter  Here.  fo(x)  denotes  the  output  of  the  learning  machine 
and  F(x)  denotes  the  target  output.  In  incremental  learning.  <i(x)  corresponds 
to  the  input  density  for  all  the  learning  samples;  this  includes  not  only  the  new 
samples  introduced  in  subsequent  phases  but  also  the  learning  samples  of  the4 
earlier  learning  phases. 

Although  the  method  proposed  in  the  previous  study  shows  a  good  perfor¬ 
mance  for  some  datasets,  it  occasionally  fails  to  yield  appropriate  learning  results 
because  of  the  failure  in  the  prediction  of  the  actual  input  density.  To  overcome 
this  drawback,  the  method  proposed  in  this  paper  improves  on  the  previously 
described  method  to  yield  better  learning  results  using  an  ensemble  of  several 
learning  results. 

The  next  section  describes  the  incremental  learning  scheme  assumed  in  this 
study.  Section  3  presents  a  model  of  virtual  concept  drifting  environments. 
Section  4-5  describe  the  incremental  learning  and  model-selection  methods  used 
in  this  study.  Section  G  explains  the  calculation  of  the  predicted  output  of  the 
system.  Section  7  presents  the  results  of  synthetic  and  benchmark  test  datasets, 
and  Section  8  provides  the  conclusion. 

2  Learning  Scheme 

Lot  us  consider  the  simplified  incremental  learning  scheme  shown  in  I  ig  .1;  it  has 
a  fundamental  incremental  learning  architecture  with  a  re-learning  (rehearsal) 
process  similar  to  that  proposed  previously  [5.  6]. 

This  system  alternates  between  two  phases.  i.e..  recording  and  rehearsal.  Dur¬ 
ing  the  recording  phase,  the  learning  system  obtains  a  new  chunk  of  the  several 
new  learning  samples  and  stores  these  samples  in  a  buffer  having  a  small  capac¬ 
ity.  After  the  recording  phase,  the  rehearsal  phase  begins.  During  the  rehearsal 
phase,  all  the  samples  in  the  buffer  and  the  samples  generated  by  the  previ¬ 
ous  neural  network,  i.e.,  the  pseudo-old  samples,  which  was  built  in  the  previ¬ 
ous  learning  phase,  are  introduced  to  the  current  neural  network.  The  initial 
parameters  of  the  current  neural  network  are  obtained  from  the  previous  neu¬ 
ral  network.  Note  that  the  current  neural  network  rehearses  not.  only  the  new 
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Fig.  1.  Incremental  learning  scheme 


novel  learning  samples  but  also  the  pseudo-old  samples  to  prevent  “catastrophic 
forgetting.” 

The  neural  network  uses  was  a  radial  basis  function  neural  network  (RBFNN). 
Let  felx)  be  the  output  value  of  the  RBFNN.  fo(x)  is  given  by 


M 

fe(x)  =  Y2  wi  CXP 
j= i 


IF  - 


(2) 


where  M  denotes  the  number  of  hidden  units.  Uj  and  rj  denote  the  center  and 
variance  of  the  j- th  hidden  unit,  respectively.  The  aim  of  the  learning  system  is 
to  minimize  the  following  evaluation  function: 

E=  f  (F (*)  ~  fe(x))2q(x)dx.  (3) 

where  F(x)  denotes  the  target  output  and  r/(x),  the  actual  input  density  required 
to  obtain  the  ideal  learning  result.  Alternatively,  if  the  actual  input  density  is 
varied,  q(x)  should  be  averaged  over  time.  Note  that  q(x )  is  not  equivalent  to 
empirical  input  density  P(x). 


3  Modeling  of  Virtual  Concept  Drifting  Environments 

In  order  to  devise  a  learning  method  to  minimize  the  weighted  error  function 
given  by  Eq(l).  we  need  to  derive  the  (average)  actual  input  density  q(x)  and 
empirical  input  density  P(x).  It  is  essential  to  predict  q(x)  beforehand  using  the 
given  learning  samples. 

3.1  Prediction  of  q(x) 

The  following  predicted  distribution  of  x  from  N  number  of  learning  samples, 
which  have  been  presented  up  till  now,  is  used  in  this  study.  Let  q(x)  be  the 
predicted  q{x). 


q{x)  =  /  P(x|S)P(S|xi,X2,“‘X;v)d*S, 


(4) 
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where  S  denotes  the  parameter  vector  that  represents  the  input  density  function. 
In  the  simplest  case.  q(x)  would  be  a  Gaussian  probability  distribution,  in  such 
cases,  according  to  Bayes  theorem,  q(x)  should  be  approximated  by  a  Student's- 1 
distribution  of  (AT  —  1  )-degrees-of- freedom.  Therefore. 


q{x)  = 


r[(AT  -  1  +7>)/2] 


((iV  —  \)ir)p/2r[{N  —  .1 ) / 2]  |I7| 


i/2 


1  + 


{x  -  u) 


Tr-  », 


x  -  u) 


N  -  1 


-  ( j\  -*  1  +p)/2 


(5) 


where  p  is  the  number  of  input  dimension,  u  =  E[x\.  f[*]  denote  gamma  function 
and  E  denote  the  scale  matrix.  The  scale  matrix  is  described  by  E  =  ((n  — 
3 )/n)C.  C  —  E[(x  —  u)(x  —  u) 1  ]  [7].  Note  that  Student  fs-t  distribution  converges 
to  actual  Gaussian  distribution  q(x)  while  increasing  the  number  of  presented 
learning  samples. 

In  many  cases,  however,  the  center  of  the  actual  input  density  is  usually 
moving  overtime  so  the  averaged  density  is  difficult  to  be  approximated  by  using 
a  single  Student distribution.  To  overcome  this  difficulty,  we  extended  Eq.(5) 
to  supporting  more  complex  input  distributions. 

Let  ns  imagine  the  sensory  inputs  of  a  robot.  In  such  an  actual  environment, 
each  sample  is  highly  related  to  the  current  state.  Therefore,  the  learner  observes 
many  similar  samples  within  a  short  interval  of  time  where  the  state  of  the  robot 
remains  almost  the  same.  If  the  robot  moves  to  another  location,  however,  its 
state  is  changed  and  the  input  distribution  is  also  changed.  Similar  situations  to 
these  frequently  appear  in  actual  incremental- learning  environments. 

Let  us  denote  the  current  state  S,  (i  =  I.  2.  ■  •).  Eaeli  state,  5,,  is  represented 
by  the  corresponding  position  of  input  space.  We  assume  that  the  state  will 
change  during  certain  periods  but  will  return  to  the  same  state  after  a  prolonged 
period  (e.g.  Fig  2). 

Therefore,  this  is  an 
ergodic  Markov  process 
and  means  that  the  prob¬ 
ability  for  each  Sj  con¬ 
verges  to  a  certain  value 
p(Sj)  which  does  not  de¬ 
pend  on  the  initial  state1. 
From  this.  q(x)  can  be 
approximated  as 

q(x)  ~  ^2q{x\Si)p{Si). 

JC>) 

Therefore,  q(x)  can  be  represented  as  a  mixture  of  distributions.  Similarly.  E{x) 
is  given  by 


Si 


Fig.  2.  State  transition  between  input  distributions.  Each 
state.  Si ,  is  represented  as  the  corresponding  position  of 
input  space.  p,j  denotes  the  state  transition  between  the  t- 
aiid  jf-th  states,  and  q(x\S,)  is  the  Gaussian  distribution. 


/’(a;)  ~  ^  P(x\Si)p(Si).  T  herefore. 


<Kx) 

P(x) 


E,  fl(x\Si)p(Sj) 

E,  c(*|  Si)p{Sj)' 


(7) 


1  We  assume  that  the  time  interval  for  state  transition  is  considerably  longer  than 
that  for  presenting  each  sample. 
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where  P(x\Si)  and  q(x\S{)  represent  tlio  Gaussian  distribution  and  Student *s-t 
distributions,  respectively;  the  center  of  q(x\Sj)  coincides  with  that  of  P(x\Sj). 
Tlie  calculation  of  Eq(7)  requires  the  precise  addition  of  the  coefficients  of  the 
Student* s-t  distributions;  this  can  be  approximated  without  utilizing  coefficients 
under  the  assumption  that  the  effect  of  tails  of  the  distributions  is  low. 


gj>)  gfol SjMSj) 

I>(x)  ~  P{x\Sj)p{Sj) 


q{x\Sj) 
P(x \Sj)' 


where  j 


argrnax  P(x\Si). 

i 


(8) 


p(x\Si)  is  also  a  Gaussian  distribution  Si).  The  center  Ui  and  variance- 

covariance  matrix  Sx  are  determined  by  using  an  incremental  Expectation  and 
Maximization  (EM)-algorithm,  which  is  an  improved  version  of  Ref[8].  The  in¬ 
cremental  EM-algorithin  is  not  the  same  as  the  online  EM  algorithm  because 
the  method  needs  to  work  even  if  the  distribution  of  inputs  is  not  i.i.d.  samples. 
The  detailed  algorithm  is  explained  in  the  next  section. 

In  other  words,  a  Gaussian  mixture  distribution  is  constructed  with  the  EM 
algorithm  hilt  only  the  resulting  ur  and  St  are  used  in  the  following  method. 
In  this  case,  the  appropriate  number  of  Gaussians  should  also  be  determined  by 
using  an  information-criterion  such  as  AIC[9] .  Therefore,  the  Gaussian  mixture 
distribution  having  the  smallest  AIC  value  is  the  appropriate  data- model. 

Then,  the  estimate  is  applied  to  all  Gaussian  distributions  in  the  resulting 
mixture  model.  Therefore,  if  the  likelihood  of  x  for  the  i-th  Gaussian  is  the 
maximum  of  all  likelihoods,  the  corresponding  W(x)  is 


W(x)  =  < 


p/2  r[(w  +  P- i)/2] 


(  2  V /'r[(tt  + 
\Ni-\)  r[(Ni 


i  + 


Y-i 


1  -(Arf+P-i)/2  a  A 
>  . 


where  i  =  arg  max 

j  (2 


i  -  l)/2l  cxp(~^(x  -  ui)TSi  1  (x  - ut )) 


-Uj)TC.  ’(ar-Uj)' 


I  /  {x  -  Uj 

exp  \ 


(«) 

(10) 


In  the  above  two  equations,  Sx ,  and  ux  correspond  to  the  degree  of  freedom, 
the  scale  matrix,  and  mean  input  vector  of  the  i-th  Gaussian  distribution.  The 
scale  matrix  is  described  by  Si  =  ((n  —  3 )/n)Cj,  C3  =  P[(x  —  Uj)(x  —  u3)J } 

[7].  The  degree  of  freedom  was  set  to  Nx  ~  |P(xn|Sj)/ P(xn|5j)|  for 
simplicity. 

Note  that  this  method  approximates  q(x)  using  given  samples.  Consequently, 
if  the  number  of  given  samples  is  too  small,  it  is  hard  to  accurately  approximate 
q{x).  The  method  only  approximates  q(x)  where  x  is  near  to  one  of  the  learned 
samples.  Although  it  has  the  above  drawback,  the  method  is  adequate  for  learn¬ 
ing  RBFNN.  This  is  because  the  RBFNN  eventually  yields  proper  outputs  only 
for  ars  near  to  the  learned  samples. 
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4  Incremental  EM-Algorithm 

To  predict  <](x)<  the  system  needs  to  execute  the  EM-algorithrn  to  construct  a 
Gaussian  mixture  model  for  P(x).  The  original  EM-algorithni,  however,  needs 
to  store  whole  learning  samples  in  advance  to  execute  the  algorithm.  To  save  the 
storage  space  and  computational  power,  an  online  version  of  the  EM  algorithm 
would  be  suitable  for  this  system.  Unfortunately,  current  online  EM  algorithms 
(e.g.  [10]),  are  designed  to  forget  past  learning  results  to  adjust  to  the  current 
input  distribution.  This  property  is  not  suitable  for  handling  the  virtual-concept 
drift  ing  environments. 

In  the  virtual-concept  drifting  environments,  the  model  parameters  should 
be  adjusted  not  only  to  the  new  learning  samples  but  also  to  the  old  learning 
samples.  Therefore,  the  system  needs  the  old  learning  samples  as  well  as  the 
new  ones  to  reconfigure  the  model  parameters.  To  overcome  the  problem,  the 
system  uses  pseudo-samples  generated  by  the  RBFNN,  which  was  constructed 
in  the  previous  rehearsal  phase,  instead  of  using  the  real  old  learning  samples. 
The  pseudo-sample  generation  algorithm  is  to  be  explained  in  section  5.2. 

The  number  of  the  Gaussians  should  also  be  an  optimal  number  to  get  greater 
generalization  capability.  To  search  such  data-model  quickly,  the  EM-algorithrn 
and  the  AlC  estimation  are  applied  vv  bile  adding  the  number  of  Gaussians  one  by 
one  to  the  preceding  best  model  until  the  current  AlC  value  becomes  larger  than 
that  of  the  previous  one.  Then,  the  previous  model  is  derived  as  the  ultimate 
Gaussian  mixture  model. 

5  Incremental  Learning  and  Model  Selection  for  RBFNN 

The  RBFNN  learns  samples  stored  in  the  buffer  and  pseudo-samples  generated 
from  the  previous  RBFNN  in  each  rehearsal  phase.  As  discussed  in  Section  I, 
the  RBFNN  has  to  minimize  the  weighted  error  function  i.e.,  Eq.(l). 

In  this  study,  a  modified  version  of  the  quick  RBFNN  learning  method  pro¬ 
posed  by  Moody  and  Darken  1989  [II]  was  used  because  it  ensured  that  the 
output  connection  strengths  were  always  optimal  values,  which  minimized  the 
error  function,  under  a  corresponding  setting  for  hidden  units.  Moreover,  it  could 
also  support  various  numbers  of  hidden  units,  which  were  fewer  than  those  of  the 
learning  samples.  The  appropriate  number  of  hidden  units  was  selected  using  an 
information  criterion,  / Cw ,  described  in  section  5.4. 

5.1  Learning  of  First  Chunk 

In  the  modified  RBFNN  method,  the  centers  and  variances  of  the  RBF  hid¬ 
den  units  are  determined  using  a  weighted  fuzzy  A;- means  algorithm,  whereas 
the  connection  weights  between  the  RBF  hidden  units  and  the  output  unit  are 
determined  by  a  weighted  least  squares  (WLS)  method. 

The  weighted  fuzzy  A- means  algorithm,  which  is  an  extended  version  of  the 
fuzzy  fc-means  algorithm  [12],  updates  cluster  center  itj  not  only  according  to 
the  cluster  centers  obtained  in  the  previous  step  but  also  according  to  the  weight. 
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of  each  sample,  as  given  in  the  equation  below.  Note  that  the  cluster  center  is 
the  center  of  each  hidden  unit  of  the  RBFNN. 

(„+i)  ^\V(xb)xbexp{-\\xb-u{n}\\2/(2c2)) 

u)  :=  )  - - .  (Hi 

fci  ^E/exp(-||x6-«';i)||V(2r2)) 

where  cw  =  Ylb=\  W(xf,)  and  c  is  the  standard  deviation.  For  simplicity,  the 
initial  centers,  u  °\  are  set  to  the  first  k  samples,  i.e..  x3 ,  in  the  buffer  B.  After 
converging  the  weighted  fuzzy  A*- means  algorithm,  the  variance  of  each  hidden 
unit  is  set  to 

°j  =  K  minf  *j\\U3  ~uj'\\2’  ( 1 2) 

where  k  (>  0)  denotes  the  overlapping  factor  [13]. 

The  WLS  derives  the  optimized  output  connection  vector  w ml  =  (w \ ,  »’2,  ••• . 
wm)1  i  where  wt  denotes  the  connection  weight  between  the  i-th  hidden  unit  and 
the  output  unit  that  analytically  minimizes  Eq(l).  Therefore. 


WML  =  ($oW0^o)‘  'fcJWoFo,  (13) 

where  Fq  denotes  target  output  vector  of  the  first  chunk  (Fq  =  (F(x i),  F(x 2), 

•  •  • ,  F(xfM 0)))T)  and  W0  is  a  diagonal  matrix,  whose  diagonal  elements  are  given 
by  Wo  bb  =  W(xb)  (b  =  1, 2,  •  ■  ■ ,  iVo).  $0  is  the  design  matrix  of  the  first  chunk, 
whose  elements  are  given  by  <t>obj  =  exp(— ||xfe  -  u3 ||2/ (2rr2))t  Using  the  modified 
RBFNN  method,  the  learning  system  ensures  that  the  output  connections  will 
always  have'  optimal  weights,  so  that  we  can  accurately  estimate  the  effect  of  the 
weighted  error  function  given  by  Eq  (1). 

5.2  Pseudo  Sample  Generation 

The  WLS  method  essentially  requires  all  the  samples  to  construct  a  design  ma¬ 
trix  which  is  used  in  Eq(13).  However,  the  recording  all  the  samples  consumes 
huge  storage  space  in  the  later  steps  of  learning.  To  overcome  this  problem,  we 
should  regenerate  them  as  pseudo-samples.  One  method  of  generating  a  pseudo- 
sample  is  to  use  the  center  of  the  hidden  unit  and  the  corresponding  output  of 
the  RBFNN  as  proposed  in  [5]  14].  However,  two  problems  are  encountered  in 
applying  such  methods  for  this  learning  system: 

1.  This  model  reduces  the  number  of  hidden  units  through  model  selection 
therefore  the  number  of  pseudo  samples  will  also  be  reduced.  A  small  number 
of  pseudo  samples  usually  yields  poor  learning  results. 

2.  The  pseudo-sample  distribution  generated  by  the  former  models  is  not  equiv¬ 
alent  to  the  original  sample  distribution.  This  also  degrades  the  system 
performance. 

To  overcome  these  problems,  the  system  stores  the  RBFNN  parameters,  that 
were  determined  in  the  previous  rehearsal  phase,  and  the  kev  information  of  the 
learned  samples.  The  key  information  for  the  />th  learning  sample  is  (jp,7?p), 
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where  jp  —  arginax,,  (j)a{xp)  and  Fp  =  fon  ,(#,>),  where  foTt  }{xp)  denotes  the 
previous  RBFNN  output  for  xp .  Note  that  the  system  only  needs  to  save  a  single 
two-dimensional  data  (jp .  Fp)  for  one?  learning  sample.  Therefore,  lesser  storage 
space  is  used  for  saving  this  information  than  for  saving  the  real  learning  samples 
if  the  number  of  dimension  for  xp  is  larger  than  two.  Using  RBFNN  parameters 
of  the  previous  rehearsal  phase,  the  system  can  approximately  re-generate  the 
/>th  sample  by  minimizing  the  difference  between  the  recorded  output  and  the 
RBF  output  values  as  follows: 


h„-x{xv)Y2 

,,  •  -  X,,  r,  0i 


(U) 


where  7/  denotes  the  varying  speed  of  xp.  This  is  known  as  the  gradient  descent 
method,  where  the  convergence  speed  depends  on  the  value  of  7/.  The  initial 
vector  of  xp  is  set  to  Ujp  +  e.  where  jv  is  the  key  information  for  the  p-tli 
learning  sample  and  e  is  a  small  random  vector.  The  method  is  repeated  until 
convergence.  Thereafter.  xp  can  be  used  as  the  jhih  pseudo-learning  sample. 


5.3  Incremental  Weighted  Least  Squares 


If  the  system  receives  the  77-th  new  chunk,  it  creates  a  clone  of  the  provisional 
best  RBFNN,  which  was  constructed  in  the1  previous  rehearsal  phase,  as  the  new 
learner.  Then,  the  new  learner  learns  not  only  the  new  samples  but  also  the 
old  samples,  which  are  recalled  from  the  provisional  best  RBFNN.  At  first,  m 
new  hidden  units  are  appended  to  the  learner  Then,  the  system  configures  the 
hidden  unit  centers  for  the  new  hidden  units  as  well  as  the  old  hidden  units. 
This  process  is  achieved  by  applying  the  weighted  fuzzy  A*- means  method  (see 
5.1)  to  the  above  hidden  units  using  both  the  new  learning  samples  and  the 
pseudo  learning  samples  generated  by  the  procedure  described  in  5.2. 

After  the  configuration  of  the  hidden  unit  centers.  is  generated  as  follows. 


where  xp  and  xn  denote  the  p-th 
pseudo  sample  vector  generated 
by  RBFNN(n  —  1)  and  the  77-th 
new  samples  in  the  new  chunk. 
<I>„  =  01  (x  /V..  1  )  *  *  C^TM  .  Axn_  ,)  |  (15)  Nn  1  denotes*  the  total  number 

of  learned  samples  until  the  pre¬ 
vious  reheasal  phase,  and  Nc  is 
the  number  of  the  new  learning 

samples  in  the  current  new  chunk.  Note  that  Ar„  —  Nn _  1  4-  Nc. 

After  the  generation  of  .  the  weight  connections  between  the'  hidden  units 
and  the  output  unit  are  derived  as  follows. 


0\{x\)  ■■ 

0m. 

„(*i) 

0\(x2)  ■■ 

0m, 

„  (x2) 

0i(xn„_  , )  •  • 

0m »,  (^7Vr,  1  ) 

01  (a:  1)  •• 

*  0m, 

„  (®l) 

Ol(XK')  ■■ 

*  0m, , 

(xjv,.)  . 

t»m  =  (*n „F„ 


(10) 
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where  Wn  and  Fn  are  the  weights  and  desired  outputs  of  all  samples,  respec¬ 
tively.  They  are  created  by  expanding  the  size  of  Wn  and  Fn  \  and  setting 
the  corresponding  weights  and  outputs  for  the  new  samples2. 

Note  that  m  .  the  number  of  the  added  hidden  units  is  also  optimized  by 
using  the  information  criterion  described  in  the  next  section.  The  transition  of 
the  number  of  hidden  units  is  similar  to  that  of  the  incremental  EM. 

5.4  Determination  of  A  and  Number  of  Hidden  Units 

The  flattening  parameter  A  and  number  of  hidden  units  M  must  be  determined 
properly  to  attain  greater  generalization  capabilities.  In  this  study,  the  infor¬ 
mation  criterion  ICW,  which  was  proposed  by  Shimodaira  (2000) [3] ,  was  used 
for  determining  these  A  and  M.  ICW  is  used  to  estimate  the  performance  of  a 
learning  machine  using  the  weighted  error  function  under  covariate  shift.  There¬ 
fore,  the  system  searches  (A*,  A/*)  to  ascertain  which  ICW  value  is  the  minimum. 
In  the  experiment,  we  prepared  several  sets  of  A  and  m  .  They  were  applied  to 
construct  the  new  RBFNN,  and  the  resulting  RBFNNs  were  estimated  using 

in 

1  XL'  ■ 

6  Ensemble  Prediction  of  Output 

The  quality  of  the  resultant  RBFNN  is  highly  affected  by  the  accuracy  of  the 
predictive  distribution  which  in  turn,  depends  on  the  variation  in  the  given 
learning  samples.  As  a  result,  the  resultant  RBFNN  performance  is  occasionally 
lower  than  that  of  the  original  RBFNN,  which  does  not  employs  the  weighted 
error  function  given  by  Eq.(l)  [1]  2]. 

To  overcome  this  drawback,  the  output  of  the  proposed  system  is  considered 
to  be  an  ensemble  of  the  outputs  of  the  following  two  RBFNN:  (a)  the  RBFNN. 
that  learned  the  samples  using  the  proposed  method  described  in  Section  4-5, 
and  (b)  the  original  RBFNN.  which  learned  the  samples  by  the  ordinary  least 
squares (OLS)  method.  Therefore,  the  ultimate  output  f(x )  is  given  by 

f(x)  =  WWLS-fn" -ls(x)  4-  WOLsffi<>*'s{x),  (17) 

17  n  w  n 

Note  that  fgOLs(x)  executes  the  incremental  learning  procedure  in  the  same 
way  as  fgw  ls(sc)  except  that  it  uses  the  normal  objective  function.  In  Eq.(17), 
wwls  and  wqls  denote  the  weights  for  the  two  RBFNNs.  Their  values  are 
0.5  immediately  after  completion  of  the  learning  phase;  however  they  are  se¬ 
quentially  modified  according  to  the  square  of  the  errors  of  the  new  samples 
introduced  in  the  succeeding  recording  phase. 

=  exp  (-^)  =  exp  (-^) 

UWLS  exp  +  exp  (-^)  ’  W°LS  exp  (-£i3^)  +  exp  (-&“■*) ' 

(18) 


2  The  weights  for  (ho  old  samples  are  reused  ill  the  subsequent  learning  phase  for 
simplicity. 
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where  cwls  arid  <ols  are  the  mean  square  errors  of  the  two  RBFNNs.  They 
vary  according  to  the  square  of  the  error  for  each  new  sample,  e.g.. 


<yw  ls  ’•=  <’WL$  + 


{T^A'l  -  fe»  ,  S.(XA’)}  -  ewi.  s 


(19) 


where  N  denotes  the  index  of  the  new  sample.  Note  that  the  RBFNN  having 
greater  weight  is  the  provisional  best  RBFNN  for  the  succeeding  rehearsal  phase. 


7  Experiments 

The  system  was  tested  using  one  synthetic  dataset  and  two  benchmark  test 
datasets.  For  convenience,  the  RBFNN,  that  uses  the  weighted  error  function, 
was  denoted  as  “WRBFNN"  in  these  experiments.  The  performance  of  the 
W RBFNN  was  compared  with  that  of  the  original  RBFNN,  which  does  not  use 
the  weighted  error  function.  The  original  RBFNN  is  denoted  as  uorg-RBFNN” 
hereinafter.  Org-RBFNN  is  equivalent  to  the  fundamental  architecture  of  nearly 
all  the  former  incremental  learning  systems  that  use  RBFNN  or  perception 
[G.  14  16].  Note  that  org-RBFNN  is  the  same  as  W RBFNN  for  A  =  0. 

7.1  Illustrative  Example  in  One-Dimensional  Synthetic  Dataset[2] 

The  following  simple  dataset  was  used  to  accurately  evaluate  the  system  behav¬ 
ior.  (a;,  y)  =  (x,  1.5)  where  x  ~  M{—  20, 2)  or  Ar(20,2) 

Note  that  F(x)  =  y  =  1.5.  There  were  101  learning  samples.  One  isolated 
point  {j'.fj)  =  (10,1.5)  was  manually  added  to  clearly  demonstrate  the  effects 
of  the  weighting  function, given  by  Eq(9).  The  system  learned  two  chunks  of 
the  data  sequentially.  Tim  first  chunk  consisted  of  50  samples  generated  from 
Ar(— 20,  2).  The  second  one  consisted  of  the  isolated  point  (:r,  y)  —  (10.  1,5)  and 
50  samples  generated  from  Ar(20.2).  The  overlap  factor,  k,  was  set  to  2  for  both 
YVRBFNN  and  org-RBFNN.  We  compared  the  performances  of  WRBFNN  and 
org-RBFNN.  After  the  second  rehearsal  phase,  the  proposed  system  yielded  the 
WRBFNN  having  four  hidden  units  that  learned  the  samples  for  A  =  1. 

Fig.  3  shows  the  output  curves 
for  WRBFNN  and  org-RBFNN. 
We  can  see  that  the  weight  for  each 
learning  sample  is  <  1  when  x  is 
(lose  to  the  edges.  In  particular, 
the  weight  for  the  isolated  point 
is  considerably  greater  than  those 
for  the  other  samples.  This  means 
that  if  the  learning  samples  <q>- 
pear  infrequently,  the  correspond¬ 
ing  weight  increases.  However. 
org-RBFNN  does  not  learn  such 
samples  well  due  to  their  low  fre¬ 
quency  of  appearance.  By  using  t  he 


Fig.  3.  Output  curves  for  WRBFNN 
org-RBFNN 


and 
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proposed  approach,  WRBFNN  can  learn  such  samples  bet  ter  than  org-RBFNN 
on  account  of  the  increase  in  the  corresponding  weights.  Consequently,  it  can  he 
observed  from  Fig.  3  that  output  curve  for  WRBFNN  fits  the  isolated  points. 
Note  that,  this  test  did  not  use  the  ensemble  prediction  method  for  evaluat¬ 
ing  the  effect  of  importance  weight  clearly.  Therefore,  the  proposed  system  had 
interpreted  that  the  isolated  sample  is  the  prelude  to  the  change  of  input  dis¬ 
tribution  and  similar  samples  to  the  isolated  one  will  be  introduced  in  the  near 
future. 


7.2  System  Behavior  with  Benchmark  Dataset 

To  verify  the  validity  of  the  proposed  method,  the  performance  of  the  system 
was  examined  with  regard  to  the  benchmark  datasets  for  regression. i.e.,  Auto 
mpg,  CPU  performances  and  Servo  of  the  University  of  California,  Irvine  (UCI) 
machine  learning  repository.  The  performance  of  the  proposed  system  was  com¬ 
pared  with  that  of  the  previous  system  proposed  in  [2]. 

The  parameters  used  in  both  the  systems  were  a  =  0.1  (Eq(38))and  k  = 
2.5  (Eq(12)).  In  this  experiment,  the  variance  covariance  matrix.  Ej,  for  the 
Gaussian  mixture  model  was  used  as  a  diagonal  matrix  for  simplicity. 

In  the  both  the  datasets,  50  learning  samples  were  randomly  selected  from 
each  dataset  for  each  chunk.  The  two  systems  repeated  the  rehearsal  phase  three 
times.  All  the  samples  were  used  as  test  samples.  This  test  was  repeated  50  times 
for  different  learning  datasets.  To  prevent  the  learning  process  from  becoming 
unstable,  the  weight  for  each  sample  was  restricted  to  <  10. 

In  ( he  case  of  the  previous  system,  each  result  was  plotted  as  a  two  dimensional 
point,  (x,y)  =  (M SEwrbfnn,  M SEorg-Rj3FNN)’>  where  MSE+  denotes  the 
mean  square  error  calculated  by  using  all  the  samples  in  the  dataset.  Note  that 
if  WRBFNN  outperforms  org-RBFNN,  the  points  are  located  above  the  line 
y  =  x ,  In  the  case  of  the  proposed  new  one.  the  result  was  plotted  as  (t,  y)  — 
[M  S  Estimate,  MS  Ear  9-rbfnn)s  where  MS  Estimate  denotes  the  mean  square 
error  of  the  combined  outputs  defined  in  Eq.(17)  after  the  introduction  of  new 
samples  in  the  succeeding  phase. 
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Fig.  4.  System  performances  for  Auto  mpg,  CPU  performances  and  Servo  after  first 
and  second  rehearsal  phases.  Left: Previous  system.  Right:  Proposed  system. 
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Fig.  1  shows  the  responses  of  the  two  systems  for  the  two  datasets  after  the  first 
and  second  rehearsal  phases.  From  this  figure,  we  can  sec  that  the  performance 
of  the  previous  system  is  usually  lower  than  that  of  org-RBFNN  in  case  of  the 
CPU  performance  and  Servo  datasets.  On  the  other  hand,  the  proposed  system 
achieves  good  performance  for  both  the  datasets.  This  means  that  the  proposed 
system  adaptively  chooses  the  better  RJBFNN  according  to  the  mean  square 
error  of  the  new  samples. 

8  Conclusion 

In  this  study,  we  attempted  to  develop  an  incremental  learning  system  based 
on  the  predictive  distribution  of  virtual  concept  drifting  environments.  The  new 
approach  was  able  to  predict  the  input  density  of  the  new  learning  samples, 
that  were  introduced  in  later  incremental  learning  steps.  This  made  the  learn¬ 
ing  system  undergo  proactive  learning  according  to  the  predicted  input  density. 
Therefore,  the  new  incremental  learning  scheme  reinforces  the  learning  effect 
using  novel  isolated  learning  samples. 

The  proposed  system  is  an  improved  version  of  previous  systems  [l]  [2]  . 
The  main  difference  between  the  previous  systems  and  the  proposed  one  is  t  hat 
the  latter  incorporates  an  ensemble  prediction  mechanism  to  obtain  a  stabler 
recognition  ability.  Experimental  results  demonstrated  that  the  likelihood  of 
failure  of  learning  using  this  system  is  reduced.  The  system,  however,  needs  to 
adjust  the  connection  weights  for  the  ensemble  using  the  new  samples  introduced 
in  succeeding  recording  phases. 
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Abstract.  Data  visualisation  can  be  a  great  support  to  the  data  mining 
process.  We  introduce  a  data  structure  that  allows  browsing  through  the 
data  giving  a  complete  but  very  manageable  overview  over  the  entire  data 
set,  where  the  data  is  split,  into  subsets  and  displayed  from  interesting 
angles  to  reveal  the  relevant  patterns  for  each  subset. 

Based  on  the  features  originating  from  principal  separation  analysis,  a 
tree  is  grown.  A  node  of  the  tree  is  associated  with  a  feature  and  a  subset 
of  instances,  and  later  on  with  a  two-dimensional  visualisation.  At  the 
node  level,  groups  of  instances  of  different  classes  t  hat  can  be  displayed 
from  a  more  interesting  angle  are  temporarily  grouped  together  in  sub¬ 
sets.  For  eacli  of  these  subsets  child  nodes  are  created  that  display  this 
part  of  the  data  from  a  more  interesting  angle,  revealing  new  patterns. 
This  process  is  continued  until  no  further  improved  visualisation  can  be 
found. 

After  the  tree  has  been  constructed,  it  can  be  used  to  easily  browse 
through  the  data.  The  nodes  correspond  with  two-dimensional  visual¬ 
isations  of  the  data,  but  the  specific  properties  of  the  tree  allow  for 
three-dimensional  animated  transitions  from  one  node  to  another,  fur¬ 
ther  clarifying  the  patterns  in  the  data. 


1  Introduction 

Visualizing  data  can  give  a  data  miner  already  a  good  idea  of  the  structure 
of  the  data.  The  largest  problem  is  that  only  two-dimensional  images  are  eas¬ 
ily  interpretable  for  the  human  eye.  Unfortunately,  data  sets  tend  to  have  far 
higher  dimensionalities,  so  that  a  single  two-dimensional  image  does  not  suffice. 
Therefore,  multiple  visualisations  are  needed,  such  as  e.g.  a  scatterplot  matrix 
[2].  But  as  each  combination  of  two  attributes  is  visualised  here,  the  user  also 
gets  rapidly  overwhelmed  by  the  information  overflow,  even  if  the  number  of 
attributes  is  low. 

A  solution  is  to  show  the  data  from  the  most  interesting  angles  bv  using  e.g. 
principal  component  analysis  [7]  or  Fisher's  linear  discriminant  analysis  [5].  But 
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this  has  also  some  severe  limitations,  By  using  only  the  first  two  axes,  only  a 
part  of  the  structure  of  the  data  is  visible.  This  may  suffice  for  simple  binary 
problems,  but  once  the  data  is  more  complex  or  when  there  are  more  than  two 
classes,  one  two-dimensional  image  does  not  suffice.  In  these  cases  the  other  axes 
also  hold  important  information  about  the  less  predominant  patterns.  Creating 
more  images  using  these  other  axes  will  bring  little  relief  as  the  other  instances 
will  obscure  the  pattern  for  the  instances  for  which  the  patterns  are  relevant. 

Therefore  we  introduce  a  classification  tree,  the  eigentransformation  classifi¬ 
cation  tree,  whose  function  is  not  to  classify,  but  to  hold  views  on  parts  of  the 
data  from  the  most  important  angles.  The  angles  are  also  derived  bv  an  eigen 
transformation.  A  big  difference  with  other  techniques  that  view  data  from  mul¬ 
tiple  interesting  angles,  such  as  The  Grand  Tour  [1],  is  that  only  those  subsets 
of  the  data  for  which  the  current  view  is  relevant  are  displayed.  The  structure  of 
the  tree  also  allows  to  smoothly  move  from  one  visualisation  to  another,  further 
simplifying  the  interpretation  for  the  data  miner  [6]. 

2  Principal  Separation  Analysis 

The  eigen  transformation  we  use  is  called  principal  separation  analysis  (PSA), 
which  was  introduced  in  [8],  but  only  for  two  classes.  We  extend  the  principle 
here  to  multiple  classes.  Consider  that  our  global  dataset  X  C  is  partitioned 
into  the  classes  P  E  V.  We  are  only  interested  in  considering  intraclass  vectors 
and  distances.  So  define  the  intraclass  differences  set  D(V).  consisting  of  all  d- 
vectors  p  —  q  for  any  pair  of  d- vectors  p  €  P  and  q  6  Q  for  P  ^  Q  6  V.  This 
may  also  be  written  as  follows: 

D{V)  =  {  p  -  q  |  (p,  q)  €  (X  x  X)  \  (J  (P  x  P)  } 

rev 

Using  the  notation  of  [8]  we  then  define  the  reduction  matrix  for  rnulticlass 
principal  separation  components  as 

R  =  Eig(Mom(D{P))) 

We  prefer  PSA  to  principal  component  analysis  and  Fisher's  linear  discriminant 
analysis,  because  contrary  to  principal  component  analysis  it  takes  the  classes 
into  account,  and  contrary  to  Fisher’s  linear  discriminant  analysis  it  does  not 
run  into  numerical  problems  if  there  are  linear  dependencies. 


3  Eigen  Transformation  Classification  Trees 

The  PSA  will  yield  eigen  vectors  that  allow  us  to  identify  important  patterns. 
The  eigen  vectors  with  the  largest  eigen  values  will  reveal  the  most  significant 
patterns  relevant  for  the  entire  data  set,  while  the  eigen  vectors  with  smaller 
eigen  values  will  indicate  patterns  that  are  only  relevant  for  smaller  subsets. 
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We  define  an  eigen  transformation  classification  tree  (ETCT)  that  will  work 
in  a  feature  space  with  features  based  on  the  PSA  matrix,  based  on  some  pre¬ 
liminary  work  that  resulted  in  a  classifier  called  a  2-class  eigen  transformation 
classification  tree  (2C-ETCT)  [3].  The  tree  will  subsequently  split  the  space  along 
the  features  following  the  order  indicated  by  the  eigen  values.  It  will  check  for 
neighbouring  groups  whether  the  pattern  at  the  current  level  is  the  most  signifi¬ 
cant  to  separate  them,  which  will  result  in  a  split  at  that  level,  or  if  there  exists 
a  more  specific  pattern  further  down,  which  will  result  in  keeping  the  groups  to¬ 
gether  at  the  current  moment.  The  tree  will  finally  be  evaluated  to  prune  away 
those  parts  that  fit  the  specific  instances  of  the  data  but  not  necessarily  the 
patterns  in  the  data. 

The  ETCT  can  then  be  used  to  browse  through  the  data.  2D  views  are  gener¬ 
ated  based  on  the  feature  corresponding  with  the  level  of  the  node  and  its  parent 
node,  yielding  the  most  interesting  view  for  the  subset.  The'  splits  can  also  be 
displayed  on  the  same  view,  creating  zones  which  the  user  can  select  to  inves¬ 
tigate  the  corresponding  subset  from  a  more  interesting  angle.  The  structure 
of  the  tree  also  makes  it  possible  to  move  from  one  view  to  another  seamlessly 
through  3D  animations. 

4  Creating  the  Tree 

4.1  Data  Model 

For  each  node  the  level,  the  expected  class  and  the  splits  are  stored.  The  level 
indicates  the  corresponding  feature,  i.e.  the  feature  derived  from  the  eigenvector, 
at  the  index  equal  to  the  level,  of  the  PSA  matrix  sorted  by  the  eigenvalues. 
The  expected  class  is  the  class  to  which  an  instance  reaching  this  node  most 
likely  belongs.  The  splits  are  the  values  to  which  the  feature  of  an  instance 
corresponding  with  the  level  is  compared  to  decide  to  which  child  node  it  should 
be  sent. 

Each  node  also  has  ten  folds,  which  contain  the  information  that  is  used  to 
prune  the  tree.  Each  fold  corresponds  with  one  pair  of  a  training  set  and  a  test 
set.  They  store  the  expected  class  and  the  splits  based  on  the  training  set.,  and 
the  number  of  correct  and  incorrect  based  instances  of  the  test  set  based  on  the 
results  on  the  training  set. 

A  node  also  has  a  class  transform  structure,  which  is  used  to  merge  different 
classes  of  the  classification  problem  together  at  the  level  of  a  node.  This  is  done 
to  merge  classes  temporally  together  if  the  instances  belonging  to  these  classes 
can  be  better  split  deeper  in  the  tree.  A  global  class  of  the  classification  problem 
will  correspond  with  a  local  class  on  node  level.  Many  global  classes  can  map  to 
the  same  local  class. 

4.2  Algorithm 

The  creation  of  the  tree  starts  with  the  computation  of  the  PSA  matrix  based 
on  the  data  points  passed  as  a  parameter  During  the  next  step  the  instances 
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arc  transformed  using  the  matrix,  yielding  instances  with  new  features.  Then 
the  top  node  is  created  and  a  grow  message  is  sent  to  it  with  all  the  transformed 
instances  as  a  parameter.  A  first  pruning  takes  place  before  the  tree  is  evaluated. 
Then  the  transformed  data  set  is  used  to  create  training  and  test  folds  based  on 
two  5-fold  crossvalidations.  Each  pair  of  training  and  test  sets  is  used  to  send 
an  evaluation  message  to  the  top  node.  The  information  generated  during  the 
evaluation  is  then  used  for  a  final  pruning. 

Create  the  PSA  transformation  matrix  using  all  data  points 

transformed  data  set  <—  data  set  x  transformation  matrix 

Create  the  top  node 

top . Grow (transformed  data  set) 

top . PruneBef oreEvaluation 

Create  the  internal  training  and  test  folds  from 
the  transformed  data  set 
For  each  pair  of  training  and  test  sets  do 

top . Evaluate (fold  index,  training  set,  test  set) 

End  for 

top . PruneAf terEvaluation 

5  Browsing  the  Tree 

After  the  ETCT  is  built,  it  can  be  used  to  visually  browse  through  the  data  set. 
A  21)  view  is  generated  based  on  the  information  of  a  node  and  the  instances 
that  reached  that  node  after  applying  the  splits  in  the  nodes  above.  The  user 
can  move  to  a  node  below  by  left  clicking  the  corresponding  zone  on  the  current 
21)  view.  This  will  trigger  a  3D  animation  that  will  zoom  in  on  the  selected 
zone,  add  information  of  the  selected  node  and  rotate  the  information  of  the 
previous  node  away,  similar  to  the  technique  used  in  [4].  By  right  clicking  the 
visualisation,  the  user  moves  up  a  node  in  a  similar  fashion,  where  information 
of  the  newly  selected  node  is  rotated  in  and  information  of  the  previous  node 
rotated  out,  followed  by  a  zoom  out. 

6  Example 

Advantages  of  the  ETCT  such  as  (1)  limiting  the  number  of  views  on  the  data, 
(2)  finding  the  more  interesting  angles  and  (3)  the  possibility  to  move  from  one 
view  to  another  through  an  animation  are  direct  consequences  of  the  properties 
of  the  structure  of  the  ETCT.  The  final  advantage  of  the  ETCT  against  other 
visualization  techniques  is  that  (4)  smaller  patterns  are  made  visible  as  only  the 
instances  to  which  the  pattern  applies  are  shown.  To  illustrate  this,  we  use  an 
artificial  data  set  with  1000  instances,  7  classes  and  4  attributes.  Each  group 
is  largely  linearly  separable  from  each  other  group  along  one  of  the  attributes 
except  for  the  green  and  cyan  groups,  which  overlap  for  all  attributes.  The  com¬ 
bination  of  groups  and  the  attribute  that  separates  them  are  the  patterns  we 
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Fig.  1.  Top:  Scatterplot  Matrix;  Bottom:  Eigen  Transformation  Classification  Tree 
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arc  looking  for.  Moreover,  some  patterns  are  more  significant  than  other.  Forty 
percent  of  the  instances  belong  to  the  red  class,  and  can  be  separated  from  all 
other  groups  using  the  first  attribute.  The  other  groups  are  equal  in  size,  but  the 
second  attribute  allows  separating  the  violet  and  indigo  classes  from  the  blue, 
cyan,  green  and  yellow  classes;  as  well  as  separating  the  yellow  class  from  the 
violet,  indigo,  blue,  cyan  and  green  classes.  The  other  attributes  only  separate 
two  of  the  smaller  groups.  This  data  set  also  nullifies  the  other  advantages  of 
the  ETCT  as  the  number  of  attributes  is  limited  (relevant  for  advantage  (1)), 
none  are  redundant  (relevant  for  advantage  (2))  and  each  pattern  is  expressed 
uniquely  by  one  attribute  (relevant  for  advantage  (2)),  thereby  only  illustrating 
the  ability  to  reveal  smaller  patterns  (advantage  (4)).  For  all  these  reasons  is  the 
corresponding  scat  ter  plot  matrix  shown  in  figure  1  very  manageable,  complete 
and  shows  the  data  from  the  most  interesting  angles.  Therefore  can  no  other 
technique  that  uses  the  full  data  yield  better  results.  When  we  evaluate  the  vi¬ 
sualisation  techniques  based  on  visibility  of  the  patterns,  we  observe  that,  only 
the  scat  ter  plots  in  the  upper  left  reveal  clear  patterns  while  additional  informa¬ 
tion  is  hard  to  discern  in  the  scatterplots  located  more  in  the  lower  right. 

Figure  1  also  shows  the  ETCT  of  the  same  data  set,  where  the  nodes  are 
represented  by  their  respective  21)  views.  By  removing  the  instances  for  which 
the  orientation  is  not  relevant,  the  smaller  patterns  are  clearly  visible  without  any 
redundant  views.  An  animated  depth  first  exploration  of  this  tree  can  be  found 
at  http'. //homepages,  vub.ac.be/~sdebruyn/etct/etct  .avi  (format:  XVid), 
illustrating  both  the  moving  down  and  moving  up  animations.  The  animations 
link  the  more  informative  2D  views,  making  the  data  even  more  understandable 
for  the  user. 
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Abstract.  Computational  Tree  Logic  (CTL)  model  update  is  an  ap¬ 
proach  to  software  verification  and  modification,  where  minimal  change 
is  employed  to  generate  updated  models  that  represent  the  corrected 
software  design.  In  this  paper,  we  propose  a  new  update  principle  named 
minimal  change  with  maximal  reachable  states  (II)  which  is  a  further  op¬ 
timisation  of  an  existing  algorithm  to  solve  a  model  explosion  problem 
during  CTL  model  update.  We  provide  comparison  of  the  two  meth¬ 
ods  based  on  Graph  Theory.  The  algorithm  of  this  update  principle  is 
also  provided.  Our  experimental  results  show  that  in  the  case  of  updat¬ 
ing  the  Andrew  File  System  protocol  model,  the  new  CTL  update  ap¬ 
proach  significantly  narrows  down  the  committed  models  to  fewer  strong 
committed  models. 

Keywords:  model  checking,  model  update,  minimal  change. 


1  Introduction 

Error  repairing  is  an  formal- based  approach  that  complements  model  checking. 
An  example  is  the  application  of  AI  techniques  to  model  checking  and  error 
repairing  [1  We  have  recently  developed  a  software  error  repairing  t  echnique  [(>] 
for  updating  models  expressed  using  CTL  notation.  The  technique,  referred  to 
as  CTL  model  update,  is  supported  by  a  prototype  algorithm  and  has  been 
applied  to  several  examples  [5].  The  methodology  of  model  update  unifies  model 
checking  and  modification  and  can  closely  retain  the  efficiency  of  model  checking 
as  well  as  being  able  to  develop  a  systematic  approach  for  system  modification. 
The  CTL  model  update  algorithms  described  in  earlier  papers  typically  generate 
multiple  solutions,  some  less  appropriate  than  others:  we  shall  refer  to  this  as 
the  model  explosion  problem.  The  challenge  is  to  minimise  the  number  of  non- 
optimal  solutions  to  improve  the  efficiency  of  the  model  updater. 
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2  CTL  Model  Update:  An  Overview 

2.1  CTL  Syntax  and  Semantics 

Definition  1.  [3]  Let  AP  be  a  set  of  atomic  propositions.  A  Kripke  model  AI 
over  AP  is  a  time  tuple  AI  =  (5,  P.  L)  where:  1.  S  is  a  finite  set  of  states,  2. 
R  C  S  x  S  is  a,  transition  relation ,  3.  L  :  S  — »  2  is  a  function  that  assigns 
each  state  with  a  set  of  atomic  propositions. 

Definition  2.  [7]  Computation  tree  logic  (CTL)  has  the  syntax  given  in  Backus 
naur  form:  0 T|  X  |p|(~,0)|(0i  A  02)|(0i  V  02)|0i  —*  <j>2\AX<j>\EX(f)\AG<l) 
\EG(j)\AF(h\EF(j)\A[(f>\  U02]|/?[0i  U02]  where  p  is  any  propositional  atom. 

Definition  3.  [7]  AI  —  ( S ,  R,  L)  is  a  Kripke  model  for  CTL  and  s  €  S.  A  CTL 
formula  0  holding  in  state  s  is  denoted  by  (AI,  s)  0.  The  satisfaction  relation 
|=  is  defined  by  structural  induction  on  CTL  formulae: 

1.  (M,s)\=piffpeL(s). 

2.  (M,s)\=^4>iff(M,s)^<j>. 

3.  ( AI ,  .s)  t=  0i  A  02  iff  (AI,  s)  \=  0i  and  (AI,  s)  |=  0 2. 

4.  (A/,  s)  h  <t>\  v  02  iff  (AI,  s)  (=  0i  or  (A I,s)  f=  02. 

5.  (A/,  s)  (=  0!  — ►  02  iff  (a/,  6')  ^0i,  or  (A/,  ,s)  |=  02. 

6.  (AI,s)  \=  AXcj)  iff  for  all  s\  such  that  (5,**!)  E  R,  (A/,  .si)  |=  0. 

7.  (A/, .«?)  (=71(70  iff  for  all  paths  n  =  [.sq?  S|,  .s2,  •  •  •],  where  ,Sq  =  .$  and  V.sM 

€  7T.  (A/,  .Si)  f=  0. 

(A/,  5)|=  j4[0i  U02]  iff  for  all  paths  r  =  [sq,  si  ,  .s2,  •  •  •],  where  ,sq  =  s,  B.s*  E  7r, 
(A/.  Si)  |=  02?  aw/  /er  each  j  <  i ,  (A/,  sy)  (=  0i . 

A  CTL  formula  0  is  evaluated  on  a  Kripke  model  M  and  is  satisfiable.  A  path 
in  M  from  a  state  s  is  an  infinite  sequence  of  state's 

def  r  1 

ft  =  [sen  1 1  'i>i^  1  ?  *  *  *  *  sj  1  *  *  ‘J 

such  that  .so  =  s,  (Si,Sj+i)  €  R  holds  for  all  i  >  0.  (sj,Si+\)  C  7r  and  s,  E  7r. 
We  denote  st  <  <sy  if  .s*  is  a  state  earlier  than  sy  in  7r.  We  denote  state  sf  as 
succ(s)  if  there  is  a  relation  (s,sf)  in  R.  s'  could  be  one  of  a  set  of  successor 
states  of  s.  If  succ(s)  ^  0.  we  express  it  as  succ(s,  ->0).  If  a  state  is  accessible 
by  transitions  from  an  initial  state  .so,  it  is  called  a  reachable  state.  We  use 
RS(X 1)  —  RS(AI.sq)  to  denote  the  set  of  all  reachable  states  from  .s0  in  AI. 
Similarly,  we  use  RS  j(M)  =  RS(M,  Sj)  to  denote  the  set  of  reachable  states 
from  any  state  .s*  in  M.  The  unchanged  reachable,  states  mean  that  the  reachable 
states  in  an  updated  model  are  also  in  the  original  model.  A  state  s  is  called 
true  state  for  0  if  s  |=  0  and  called  false  state  for  0  if  6*  ^  0. 

2.2  Minimal  Change  for  CTL  Model  Update 

Definition  4.  [6]  fCTL  Model  Update^  Given  a  CTL  Kripke  model  AI  = 
(5,  /?,  L)  and  a  CTL  formula  0  such  that  AA~  (AI,  .sq)  ^  0,  where  sq  e  S. 
Update(AA,  0)  derived  from  M  to  satisfy  0  results  in  an  updated  model  AP  = 
( Sf ,  R',L')  such  that  M!  —  (AI\s'0)  |=  0  where  .Sy  E  S'. 
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Update  is  achieved  by  applying  a  combination  of  primitive  update  operations 
PUl  to  PU5.  Given  M  =  (S,  /?,  L),  its  updated  model  M'  =  (S',  /?'  V)  is: 


PU1:  Adding  a  relation, 

S  ~  S s  Ij  —  L ,  and  R  =  /?U{  («sar,  .sar2)}  where  (.snr,  ■sar2)  $  R  foi  ?  £’ar2  £  *5* 

PU2:  Removing  a  relation, 

5'  =  5;  7/  =  L,  and  /?/  —  /?- {(ts>r, .srr2)}  where  (firr,«rr2)  £  7?  for  srr,.srr2  €  5. 


PU3:  Substituting  a  state  and  its  associated  relation(s), 

S’  =  S[s/sa9]  ir  =  R\J{(sU8M),(sM,8j)  I  (*,*),(*,*)  €  ft}  -  {(*,»),  | 

(.s,,.s),  (,s,.s-j)  €  /?} ,  and  for  all  .s  G  S  Pi  S',  L'(s)  —  L(.s),  and  L'(sss)  is  a  set  of 
true  variables  assigned  in  sSM. 


PU4:  Adding  a  state  and  its  associated  relation(s) 

S'  =  S  U  {.$rM},  /?'  =  R  U  {{su  .sMiS),  (.s„iS,  Sj)  |  for  some  .s,.  .sy  G  5'},  and  for  all 
s  G  S  fl  5',  L'(.s)  —  L(i s),  and  7/(.sal)  is  a  set  of  true  variables  assigned  in  sns. 

PU5:  Removing  a  state  and  its  associated  relation(s). 

S'  =  S  -  {.srs  |  srs  G  S},  R!  =  R  —  {(.s^  .sr<s),  (.s>*,,sy)  |  for  some  s,.Sj  G  5},  and 
L'(.s)  -  L(.s)  for  all  ,s  G  SnS'. 

Given  models  AI  =  (S, /?,  L)  and  A/'  =  (S',  ft', //),  where  A/'  is  an  updated 
model  from  AI  by  only  applying  operation  PUi  on  AI.  We  define  Diffrri 
(ALAI')  =  Diff(R,  R')  (i  =  1,2),  DiffPUi(M,  A/')  =  Di.ff(S.S')  (i  =  3,4,5) 
and  For  PU3  to  up¬ 

date  .s  0  to  ,s*  |=  0,  we  say  Di  f  f  (.s,  .s*)  is  minimal  if  we  cannot  find  Di  f /(.s,  s  ) 
C  Diff(s,  .s*),  where  .s"  f= 


Definition  5.  (Closeness  Ordering )  Given  three  CTL  Kripke  models  AT  AI \ 
and  A/2.  where,  Af  and  At?  air  obtained  from  AI  by  applying  PU1-PU5  opera- 
turns,  we  say  that  AI\  is  closer  or  as  close  to  A I  as  A/2,  denoted  as  M\  <\  /  M2, 
iff  Dif  /(Af,  A / 1 )  ^  Diff{ALAIz).  We  denote  A/,  <A/  A/2  if  AI \  <m  A/2  and 
A/2  /  A/ 1 . 


Definition  G.  (Admissible  Update )  Given  a  CTL  Kripke.  model  A4=(S ,  /?,  />), 
Af  =  (A/,  ,s0),  where  sq  G  5,  and  a  CTL  formula  0,  U  pdote(j\4 . 0)  is  called  ad¬ 
missible  i/:  f/j  U ]>date(M  1  <t>)  -  (A/', «{,)  0  w/ierc  A/'  =  (S',ft',L')  rmd 

.s{)  G  S';  and  (2)  there  does  not  exist  another  resulting  model  AI "  —  (5",  /?",  L") 
and  Sq  G  S"  .swi.  /7«i/  (A/",  Sq)  |=  0  and.  A  I"  <A/  A/'. 

Theorem  1.  AI  =  (5,  /?,  L)  ?.s  a  Kripke  model,  M  =  (A/,  .s*o)  and  ^  /1(70, 
where  .s*o  €  S  and  0  is  a  propositional  formula .  A?/  admissible,  model  AF  = 
Update(M,  AG<t>)  can  be  obtained  by  the  following:  for  each  path  tt  =  [.Sq ,  •  •  • , 

I.  if  for  all  s  <  s,  in  tt.  s  [=  0  but  s,  ^  0.  PU2  is  applied  to  sf  to  remove 
relation  (sj_ or  PIJ5  is  applied  to  st  to  remove  Sj  and  its  associated 
relations ,  or 

-2.  Pf/5  is  applied  to  all  states  $  in  tt  not  satisfying  <p  to  substitute  s  with  s*  ^  0 
ond  Dif  f(s,s*)  is  minimal. 
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Definition  7.  (Minimal  change  with  maximal  reachable  states(I)j  [4,5] 
Given  a  CTL  Kripke  model  M  =  (S.  R.  L) .  M  =  (M.sq)  where  sq  E  S ,  and  a 
CTL  formula  <£.  Updatc(Xis(j))  is  called  committed  if  the  following  conditions 
hold:  (l)  UpdatrfM.cp)  =  M!  —  (M',Sq)  is  admissible;  and  (2)  there  docs  not 
exist  another  resulting  model  M”  =  (M",Sq)  such  that  j\4"  is  admissible  and 
RS(M)  n  RS(M')  c  RS(M)  n  RS(M"). 


3  A  Further  Optimisation 


3.1  A  New  Approach:  Reachable  States  from  One  State  to  Another 

We  consider  an  improvement  of  the  reachable  state  principle  in  Definition  7. 
If  two  states  are  preserved  in  an  update  and  there  was  a  path  between  them 
in  the  original  model,  then  there  is  still  a  path  between  them  in  the  updated 
model.  This  improved  reachable  state  principle  in  fact  provides  the  reachability 
condition  from  all  unchanged  states  in  a  model  rather  than  only  from  initial 
states  as  described  in  Definition  7. 


Definitions.  (Minimal  change  with  maximal  reachable  states  (II) ^ 

Given  a  CTL  Knpke  model  M  =  ( S ,  /?,  L),  M  =  (ALsq)  where  Sq  E  Sy  and  a 
CTL  formula  d>,  Update(M ,  d>)  is  called  strong  committed  if  the  following  condi¬ 
tions  hold:  (1)  Update(j\A,  (f>)  =  AT  =  (M^Sq)  is  admissible  or  committed;  and 
(2)  there  does  not  exist  another  resulting  model  M"  =  q)  such  that  M" 

is  admissible  or  committed  and  RS^(M)  D  RS^(M')  C  RS3(M)  H  RS(i(AAn). 


The  strong  committed  update  preserves  all  unchanged  reachable  states  in  an 
original  model  and  preserves  the  reachability  from  any  unchanged  state  to  an¬ 
other  after  an  update.  The  strong  committed  model  results  from  the  strong 
committed  update.  The  total  set  of  strong  committed  models  are  a  subset  of  the 
total  set  of  committed  models.  Thus,  a  constraint  for  deriving  strong  committed 
update  instead  of  that  for  deriving  committed  update  is  added  to  Theorem  1. 

3.2  Comparing  Reachable  State  Principles  Using  Graph  Theory 


The  reachable  state  principles  in  Definition  7  and  8  can  be  further  analysed  from 
a  structural  view  using  graph  theory  [2]. 

If  the  original  model  is  a  graph  G  =  (A,  T),  where  X  is  the  set  of  vertices 
and  T  is  a  mapping  of  the  set  A^  in  X  which  shows  how  the  vertices  relate  to 
each  other.  Its  subgraph  is  Gs  —  (Xs>  Ts)  with  Xs  C  Ar;  and  for  every  xt  E  A  s , 
Ts(xi)  =  F(xl)r\Xs.  Thus,  a  subgraph  lias  only  a  subset  Xs  of  the  set  of  vertices 
of  the  original  graph  but  contains  all  the  arcs  whose  initial  and  filial  vertices  are 
both  within  this  subset.  The  reachability  matrix  R  =  [r7J]  is  defined  as  follows: 


r*j 


j  1  if  vertex  x3  is  reachable  from  vertex  xx 
(  0  otherwise 
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The  reachable  set  of  vertices  from  vertex  :r,  is: 

R(:rt)  =  {.r;}  U  />,)  U  r2(.rt)  U  •  •  •  U  n>(xi). 

where  p  is  the  cardinality  of  the  reachable  path  from  :r,. 

Given  two  matrices  A  =  [a,-j]  and  B  —  [A*/],  where  i  >  k\  j  =  /,  if  for  any  t  wo 
elements  a  and  b  in  identical  positions  of  the  two  matrices  a  >  6  holds,  then  we 
say  A  >  B. 

A  Kripke  model  3/  —  (5.  /?.  L)  is  mapped  into  a  graph  A/  =  (S',  /?),  when1  S  is 
the  sot  of  vertices  and  R  is  the  set  of  edges.  After  model  update,  M*  =  (S\  /?'), 
where1  5'  =  Swtcimnge  0  Sup(fatf. .  Sunchange  is  a  set  of  unchanged  states  such  that 
Sunchangc  C  S.  Sup(jate  is  a  set  of  updated  states  and  Supcuitc  C  5'.  Before  update, 
a  subgraph  of  M  containing  all  unchanged  states  Sltnchangc  and  the  set  of  states 
being  updated  with  PU3  operation  only  Spus  is  G  =  (SunChange  U  Spun,  /?„), 
where  Spu$  C  S  and  Rn  c  /?.  After  update,  a  subgraph  of  A/'  containing  all 
unchanged  states  Su nchanye  and  tlie  set  of  states  derived  from  update  by  using 
PI  3  opei  at  ion  only  Sp is  G  —  unchanged  S  py  3.  7?w).  where  Spy^  (Z  Sup^Q  ti¬ 
mid  R!u  C  R! .  From  a  graph  theory  view,  the  set  of  vertices  S/7  3  =  S/.7/3  1 . 

For  the  subgraph  G,  reachability  matrix  is  /?/T  =  [r,j].  where  /  and  j  range 
over  the  number  of  states  in  G.  We  use  REin%iml  =  where  initial  is 

the  number  of  any  initial  states,  to  denote  the  reachability  after  update  described 
in  Definition  7.  In  Definition  7,  reachability  is  only  checked  from  initial  states 
corresponding  to  roots  in  Graph  theory.  Also,  i  >  initial .  Thus,  RE  > 

After  update  optimised  bv  using  Definition  8,  JiEAnyl^ra  —  [r^7|yTu,v^].  where 
AnyTwo  is  the  number  of  any  unchanged  states  and  the  number  of  updated 
states  derived  from  using  PUS  operation.  It,  is  obvious  RE  =  EEAnyTwo  un¬ 
der  the  definition  of  reachability  matrices  in  [2],  if  there  is  not  an\  unchanged 
reachable  state  lost  during  the  update.  Therefore,  REAnyTwo  >  RE1uUlaI.  This 
proves  that  minimal  change  constrained  wit li  Definition  8  retains  more  un¬ 
changed  reachable  states  than  that  of  Definition  7  does  during  an  update. 

3.3  An  Improved  Algorithm 

We  have  developed  an  alternative  algorithm  satisfying  Theorem  1  and  con¬ 
strained  with  Definition  8  to  derive  strong  committed  models  for  optimising 
AG  update.  A  Kripke  model  is  A/  =  (S.R.L)  and  Ad  =  (M,  s),  where  .s  £  S. 
Ad  is  required  to  satisfy  a  propositional  formula  </>.  The  updated  model  of  M 
is  M*  =  ( S\R'.L ')  and  M'  =  (M'.s).  RE  and  REAnyT wo  are  as  described  in 
Section  3.2. 

Update,^  (Ad  ,0)  /*  Ad^  AGcj).  Update  Ad  to  satisfy  AG6.  */ 

{  if  Ado  =  (A/,6‘o)  (+> .  then  PU3  is  applied  to  s0;  else 

(1)  applying  PU3  on  all  .s,  ^  <p  in  Ad; 

(2)  select  a  path  7r  =  [,s0,  s\.  •  •  •],  where  3s  £  7r,  such  that  Ad,  =  (A/,  a)  ^  <i>\ 
select  the  earliest  state  #,■  £  n  such  that  (A/,  s,)  ^  d>; 

perform  one  of  the  following  three  operations: 

1  The  states  before  and  after  update  using  PU3  are  supposed  to  be  the  same  because 

we  do  not  consider  the  variables  in  states  from  the  graph  theory  view. 


594  Y.  Ding  and  D.  Hemer 


(2.1)  applying  PU2  to  remove  relation  (*'i_i,,s*)  or 

(2.2)  applying  PU5  to  remove  state  s,*  and  its  associated  relations, 
obtain  result  Mf  only  if  RE  —  REAnyTwo ,  else 

(2.3)  applying  PU3  on  all  st  4>  in  n: 

if  Mf  )=  A(70,  return  M';  else  return  {Update  ag{M' 

} 

After  an  update,  the  updated  model  is  repeatedly  checked  whether  it  satisfies 
the  required  property  AG0.  If  it  does  not,  the  function  Update^  recursively 
calls  itself  until  the  updated  model  satisfies  the  specification  property.  The  final 
updated  model  is  a  strong  committed  model. 

The  algorithm  has  been  applied  to  Andrew  File  System  1  [5,8].  The  number  of 
strong  committed  models  for  this  ease  is  125  which  is  narrowed  down  from  225 
committed  models.  The  whole  process  of  the  reachable  state  algorithm  lias  been 
simulated  in  C  code  in  our  model  updater  prototype.  Our  model  updater  prototype 
automatically  perform  the  algorithm  and  the  output  are  strong  committed  models. 

For  other  CTL  formula  update  such  as  AX  and  AU  which  have  the  possibility 
of  losing  reachable  states,  their  update  algorithms  are  also  constrained  by  Defi¬ 
nition  8  in  a  similar  format  as  that  of  AG.  We  have  implemented  the  reachable 
state  algorithm  in  code  to  embed  the  algorithm  into  the  model  updater  protocol. 
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Abstract.  Relational  search  is  a  novel  paradigm  of  search  which  focuses  on  the 
similarity  between  semantic  relations.  Given  three  words  (A,  B.  C)  as  the  query,  a 
relational  search  engine  retrieves  a  ranked  list  of  words  IT  where  a  word  I )  6  3) 
is  assigned  a  high  rank  it  the  relation  between  A  and  B  is  highly  similar  to  that 
between  C  and  D.  However,  if  C  and  D  has  numerous  co-oceurrences,  then  /)  is 
retrieved  by  existing  relational  search  engines  irrespective  of  the  relation  between 
A  and  B  To  overcome  this  problem,  we  exploit  the  symmetry  in  relational  simi¬ 
larity  to  rank  the  result  set  X).  To  evaluate  the  proposed  ranking  method,  we  use 
a  benchmark  dataset  of  Scholastic  Aptitude  Test  (SAT)  word  analogy  questions. 
Our  experiments  show  that  the  proposed  ranking  method  improves  the  accuracy 
in  answering  SAT  word  analogy  questions,  therehy  demonstrating  its  usefulness 
in  practical  applications. 

Keywords:  relational  search,  relational  similarity,  symmetry. 


1  Introduction 

Relational  search  is  a  novel  search  paradigm  based  on  relational  similarity  of  word 
pairs.  For  the  query  {(A,B),(Ct ?)}<  in  which  A.  B,  and  C  are  input  words,  a  relational 
search  engine  finds  the  words  D  such  that  the  relation  between  A  and  B  is  also  held 
between  C  and  l).  A  candidate  answer  D  is  assigned  a  high  rank  when  the  word  pair  (C, 
D)  has  a  high  degree  of  relational  similarity  with  the  word  pair  (A,  B).  In  previous  meth¬ 
ods  for  relational  seareh  [3|  and  relational  similarity  measure  [  1],  the  relation  between 
two  words  in  a  word  pair  is  represented  by  lexico-syntaetie  patterns  that  frequently  co¬ 
occur  with  those  words.  However,  this  approach  imposes  a  bias  towards  the  frequency 
of  a  word  -  a  high  frequency  word  D  has  a  higher  probability  of  being  assigned  a  top 
rank,  irrespective  of  the  semantic  relation  shared  between  (A,  B)  and  (C,  D).  We  pro¬ 
pose  a  ranking  method  whieh  uses  the  symmetry  in  relational  similarity  to  alleviate  this 
phenomenon. 

To  demonstrate  the  proposed  ranking  method,  let  us  consider  the  query  {(Google, 
Eric  Schmidt),  (Microsoft,  ?)}.  Here,  denotes  an  entity.  Steve  Ballmer  is  expected 
to  be  ranked  at  the  top  of  the  result  list  for  this  query  because  Steve  Ballmer  is  the  CEO 
of  Microsoft ,  w  hereas  Eric  Schmidt  is  the  CEO  of  Google.  Moreover,  w  hen  wc  use  the 
inverse  query  {(Eric  Schmidt,  Google),  (?,  Microsoft )},  Steve  Ballmer  is  also  expected 
to  be  ranked  as  the  first  result.  This  is  because  relational  similarity  is  invariant  if  both 
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word  pairs  arc  inverted  [4].  The  invariance  of  relational  similarity  under  a  symmetric 
transformation  of  word  pairs  provides  us  with  a  practical  method  to  rank  candidates  in 
a  relational  search  engine:  wc  can  obtain  a  better  ranking  if  we  take  into  account  the 
ranking  in  the  inverse  query's  result  list. 

In  addition,  we  propose  “ complementary  rank "  for  improving  the  precision  in  rank¬ 
ing  the  result  set  of  a  relational  search  query.  When  D  is  assigned  a  high  rank  (i.c.,  top 
rank)  in  the  query  {(A,  B),  (C,  ?)},  we  can  expect  that  C  is  also  assigned  a  high  rank 
in  the  query  {(A,  B),  (?,  D)},  Therefore,  we  can  consider  the  rank  of  C  in  the  query 
{(A,  B),  (?,  /))}  as  an  additional  criterion  for  ranking  D  in  the  query  {(4,  B ),  (C,  ?)}. 
We  call  this  additional  criterion  as  the  “complementary  rank  of  D”.  In  the  proposal 
method,  we  combine  the  symmetric  property  and  complementary  rank  to  improve  the 
initial  ranking. 


2  Related  Work 

The  idea  of  relational  search  has  been  introduced  in  Vcale  [6]  and  Bollegala,  et  al.  [1]. 
Kato,  et  al.  first  implemented  relational  search  [3]  by  issuing  queries  to  a  keyword- 
based  Web  search  engine.  To  extract  candidate  answers,  they  first  query  a  Web  search 
engine  for  terms  or  lexico-syntactic  patterns  that  are  likely  to  appear  only  in  documents 
which  contain  both  A  and  B.  The  extracted  term  or  pattern  set  T  is  supposed  to  contain 
terms  or  lexical  patterns  that  express  the  relations  between  A  and  B.  Then,  they  use  C 
and  a  term  t  G  T  to  find  documents  that  contain  both  C  and  t.  The  candidate  answer  set 
D  is  then  defined  as  the  set  of  terms  that  are  likely  to  appear  only  in  those  documents. 
Then,  they  rank  the  candidate  set  using  the  likelihood  of  co-occurrence  of  the  term  D 
with  the  pair  (Cj).  Our  method  also  uses  lexico-syntactic  pattern  to  express  the  relations 
between  A  and  B.  However,  the  pattern  generation  algorithm  and  the  scoring  scheme 
are  different.  In  particular,  they  use  only  the  words  in  the  mid-fix  between  A  and  B  for 
extracting  lexical  patterns  that  might  represent  relations  between  A  and  B.  On  the  other 
hand,  we  use  wildcards  and  an  n-gram  model  which  can  precisely  capture  the  relation 
between  A  and  B  [  1]. 

Buncscu  and  Mooney  proposed  an  approach  for  overcoming  the  problem  of  bias  due 
to  high  frequency  words  as  mentioned  in  previous  section  1 2 J. However,  their  method 
needs  a  large  amount  of  texts  from  Web  documents  for  compute  word  frequencies.  This 
can  not  be  accomplished  by  using  only  snippets  from  a  keyword-based  Web  search 
engine's  results. 


3  Method 

To  answer  the  query  {(A,  B),  ( C,  ?)},  the  proposed  method  first  extracts  lexical  patterns 
that  represent  relations  between  A  and  B.  The  lexical  patterns  are  n-grams  of  the  context 
surrounding  the  pair  ( A ,  B)  in  a  sentence.  It  then  uses  the  keyword  C  along  with  these 
patterns  to  query  a  Web  search  engine  for  the  answer  D,  similar  to  [3].  To  improve  the 
ranking  of  the  results  that  are  returned  by  the  above  procedure,  we  use  the  symmetry  of 
relational  similarity  and  complementary  rank. 
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Retrieve  candidates 


|(A.B),(C,?)}  - >  {D..LVD, . HjoJ 


4  cub  bear 

5  primate  monkey 

Fig.  3.  Scoring  candidates  D  retrieved  for  the 
Fig.  2.  An  example  SAT  analogy  question  query  {(A,B),(C,?)} 


3.1  Relational  Search  on  the  Web 


Fig.  1  shows  the  process  to  find  the  answer  for  the  query  {(A  B),  (C,  ?)}.  First,  we 
extract  the  semantic  relation  between  A  and  B  by  issuing  queries  of  type  “A  *  *  *  B" 
to  a  Web  search  engine1  to  obtain  some  text  snippets  that  included  and  B  separated  by 
up  to  three  words.  Here,  denotes  a  wildcard  for  any  word.  To  increase  the  similarity 
between  two  pairs  that  have  similar  contexts,  we  generate  all  n-grams  (n  <  5)  which 
contain  both  two  words  in  a  word  pair  as  lexical  patterns  for  the  pair.  For  instance,  in 
the  sentence  "big  A  such  as  B  is  considered  to  he  ...”,  we  generate  sequences  such  as 
“ big  A  such  as  B'\  "A  such  as  B"  and  “ A  such  as  B  is".  We  obtain  the  lexical  patterns 
by  replacing  A  with  the  variable  a  and  B  with  3  in  the  original  sub-sequences:  "big  n 
such  as  fi ”,  "a  such  as  0f  and  “a  such  as  j3  is".  To  avoid  noisy  patterns,  wc  ignore  all 
patterns  whose  frequencies  are  smaller  than  a  frequency  threshold  £.  We  denote  the  set 
of  these  patterns  by  P. 

To  get  candidate  answers,  for  each  pattern  p  €  P  we  input  the  query  "p[C/a,  */3 ] ” 
(including  the  double  qoutes)  to  the  search  engine.  The  formula  p[C/o]  represents  the 
substitution  of  a  by  C  in  the  pattern  p.  For  this  query,  the  search  engine  returns  snippets 
which  include  C  and  other  words  in  the  pattern  p  and  some  extra  words  in  this  order. 
For  example,  for  the  query  “lion  is  a  large  the  search  engine  returns  snippets  such 
as  “lion  is  a  large  cat  ...”  or  “lion  is  a  large  four-legged  animal  ...”.  Because  we  want 
to  get  the  word  at  the  position  of  the  wildcard  *  in  the  query,  we  add  the  those  extra 
words  into  the  candidate  answer  set  D.  We  then  rank  the  a  candidate  D  €  S)  using  the 
following  ranking  score: 


scoieinit(j9) 


freq(“C  *  *  *  D") 


(1) 


In  Formula  1,  Pp  are  the  patterns  that  appeared  with  D ,  frcq(“p[C/n,  DAT]”)  is  the 
frequency  of  co-occurrences  of  the  word  D  with  the  word  C  and  other  words  in  the 

1  Yahoo  Boss  API  hllp://developer.yahoo.com/search/boss/ 


598 


T.  Goto  et  al. 


patterns.  Because  the  number  of  words  betw  een  C  and  D  is  less  than  three,  we  normalize 
the  sum  by  dividing  the  sum  of  freq(“p[C/a,  D //?]*')  by  the  hit  count  of  the  query  "C 
***  D”.  Finally,  we  assign  a  rank  to  each  D  G  D  using  the  score  in  Foniula  1 .  We  call 
this  ranking  as  the  initial  ranking.  The  ranking  score  scorejn,*t(.D)  is  called  the  initial 
ranking  score. 


3.2  Symmetry  in  Relational  Similarity 


In  the  initial  ranking,  a  candidate  D  might  receive  a  top  rank  merely  because  it  fre¬ 
quently  occurs  with  C  irrespective  of  the  relation  between  A  and  B.  To  solve  this  prob¬ 
lem,  we  propose  a  ranking  score  using  the  symmetry  in  relational  similarity.  Let  us 
denote  the  relational  similarity  between  (A,  B)  and  (C,  D)  by  R{(A ,  13),  (C,  D)).  Re¬ 
lational  similarity  will  remain  unchanged  under  certain  permutations  of  the  four  words 
(e.g.,  R((A,  13),  (C,  D))  =  /?((£?,  yl).  (D,  C))).  Therefore,  the  candidates  that  are 
ranked  at  the  top  by  one  form  of  the  query  (e.g.,  (A,B),(Cf?))  must  also  be  ranked  at 
the  top  by  the  other  (alternative)  forms  of  the  query  (e.g.,  (B,A)f(?,C)).  In  other  words, 
if  f)  is  an  incorrect  candidate,  then  it  will  be  ranked  at  the  top  only  in  a  small  number 
of  alternative  forms  of  the  query  and  it  will  receive  bad  ranks  in  almost  all  alternative 
forms.  To  consider  the  symmetric  property,  we  define  the  score  of  D  as  follows: 


score(D) 


scoreCOInp(D)  T  scorecompR(Z)) 
2 


(2) 


In  the  above  formula,  scorecomp(L>)  is  the  score  of  D  in  the  query  {( A,B),(C ,  ?)}  when  we 
take  into  account  the  complementary  rank  (we  will  explain  complementary  rank  in  the 
next  section).  Similarly,  scoreCOmpH(^)  is  the  score  of  D  in  the  other  forms  of  the  query 
whose  similarities  are  invariant  to  a  symmetric  transformation  (e.g.,  {(B,A)>(?>C)}). 

In  addition  to  symmetry,  we  use  complementary  rank  of  C  or  D  to  rank  candidate 
answers  in  a  relational  search  engine.  The  complementary  rank  of  a  candidate  D  in  the 
query  {(A,  B),  (C,  ?)}  is  the  initial  rank  of  C  in  the  query  {(A,  B),  (?,  D)}  and  vice 
versa.  We  define  the  score  of  D  by  using  complementary  rank  as  follows, 

rankini(O)  +  ra,tkAD7Dr)  +  ^DAD,(r) 

scorer0mp  ( D)  —  ~  ,  (3) 

where  rank;,,} (D)  is  the  rank  of  D  in  the  initial  ranking  (i.e.,  ranking  by  scorejnjt(.D) 
as  shown  in  Fomula  1),  rank/\B?n(Cf)  is  the  initial  rank  of  C  in  {(A,  #),(?,  D)\  and 
rankBAD?(Cf)  is  the  initial  rank  of  C  in  {(B,  A),(Dy  ?)}.  We  denote  the  score  of  D  in 
initial  ranking  of  {(A,  B),(C \  ?)}  as  scorecomp(^)  and  the  score  of  D  in  initial  ranking 
of  {(/?,  A)y(‘.\  C )}  as  scorCcompR (D).  By  combining  the  Formula  2  and  3,  we  obtain  the 
final  score  of  D  (score(D))  for  ranking  candidates  D  G  D. 

We  illustrate  the  process  of  calculating  scorecomp(^)  in  Figure  3  in  the  query  {(A, 
B),(C \  ?)}.  We  assign  D  a  high  rank  if  C  is  assigned  high  ranks  when  we  use  the  queries 
{(A,  B),(?,D)}  and  {(B,A)y(D,  ?)}. 


4  Evaluation 

4.1  Experiments 

To  evaluate  the  proposed  ranking  algorithm,  we  use  the  SAT  dataset  [1,5].  The  SAT 
dataset  contains  371  word  analogy  questions  selected  from  the  Scholastic  Aptitude  Test. 
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Each  questions  has  a  question  word  pair  (stem  pair)  and  five  choices  for  answer  word 
pairs,  in  which  the  correct  pair  has  the  highest  similarity  with  the  stem  pair  as  shown  in 
Fig.  2.  Therefore,  we  use  the  following  method  for  solving  SAT  analogy  questions. 


Calculating  the  score  of  a  word  in  the  search  result  set 

Given  a  stem  word  pair  (A,B)  and  a  choice  word  pair  (C.D)  (e.g.,  A  is  ostrich ,  B  is  bird, 
C  is  lion  and  D  is  cat),  we  first  perform  the  query  {(A,B),(C,  ?)}  to  obtain  a  candidate 
answer  set  £).  Using  the  Formula  1 ,  we  rank  the  set  S  to  get  the  initial  ranking.  Suppose 
that  the  rank  of  D  in  this  ranking  is  ND.  Next,  we  perform  the  query  {(A, #),(?.  D)} 
to  obtain  a  candidate  set  and  record  the  rank  (according  to  the  score  in  Formula  1 )  Nf 
of  C  in  this  set.  Similarly,  we  use  the  query  {(B,A),(D,  ?)}  to  get  the  rank  of  C  as  A^  . 
Finally,  we  define  the  SAT  candidate  score  of  P  using  the  following  formula: 


SATSubScorc(D) 


N°  + 


(4) 


Score  of  a  SAT  candidate  answer 

Wc  calculate  the  score  of  a  SAT  candidate  word  pair  c  —  (C.  D)  as  follow 


SATScore{c) 


S A TSubS core ( C )  T  SATSubScon  (P) 
2 


(5) 


After  calculating  SATSeorc  for  each  candidate  SAT  answer,  we  select  the  choice  whose 
score  is  minimal  as  the  answer  to  the  SAT  question.  To  evaluate  the  performance,  we 
compare  the  answer  that  our  system  outputs  w  ith  the  correct  answer. 


4.2  Results 


We  obtain  105  correct  answers  before  using  the  symmetry  and  complementary  rank. 
After  using  symmetry  and  complementary  rank,  we  get  1 14  correct  answers.  Table  1 
shows  the  experimental  results.  When  we  do  not  retrieve  the  word  C  or  D  for  all  five 
ehoiees,  we  can  not  use  the  queries  {(A,B),(C,  ?))  or  {(A, /?),(?, D)}  respectively.  In  such 
cases,  we  can  not  estimate  our  method's  effect,  so  we  also  measure  the  performance 
when  we  ignore  those  cases.  After  eliminating  such  eases,  only  213  questions  remain. 
For  those  questions,  the  proposed  method  achieved  an  accuracy  of  46.9%  when  use  the 
symmetry,  whereas  in  initial  ranking  it  is  only  13.0%. 

To  measure  our  method's  effect,  we  consider  questions  including  correct  answers 
and  two  or  more  answer  candidates  which  include  C  or  D.  This  results  in  216  questions 
in  which  we  made  78  correct  answers  (36.1%)  before  utilizing  symmetry  and  comple¬ 
mentary  rank  and  87  correct  answers  (40.3%)  after.  Therefore,  by  using  symmetry  and 
complementary  rank,  we  could  obtain  1.2%>  improvement  in  the  SAT  result. 


Table  1.  Comparison  of  correct  rates 


Criterion 

Initial  ranking 

I  'sing  symmetry  and  complementary  rank 

#  correct  /  #  questions  (recall) 

28.1% 

30.5% 

#  correct  /  #  questions  lhat  wc  can  gel  (  or  I)  (precision) 

4.1.0% 

46.9% 

#  corrccl  /  #  questions  lhal  wc  can  retrieve  the  correct  choice 
and  at  least  one  other  choice 

36.1% 

40.3% 
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5  Discussion 

We  observe  that  the  use  of  symmetry  and  complementary  rank  improves  the  initial 
ranking.  This  shows  that  the  proposed  ranking  method  ean  be  effectively  applied  to 
rank  relational  search  results.  Especially,  the  proposed  method  of  exploiting  symmetry 
of  relations  ean  be  combined  with  advanced  lexieal  pattern  extraction  techniques  (e.g., 
PrefixSpan  algorithm,  ete.)  to  drastieally  improve  the  precision  of  relational  seareh.  Fur¬ 
thermore,  one  ean  improve  the  precision  by  combining  existing  relational  seareh  scor¬ 
ing  algorithm  such  as  [3]  with  the  proposed  seoring  algorithm.  Therefore,  the  proposed 
method  ean  be  smoothly  integrated  with  other  existing  methods  for  ranking  relational 
search  results.  The  integration  ean  be  done  easily  beeause  the  proposed  method  exploits 
a  speeial  aspeet  of  relations  (i.e.,  the  symmetry  of  relations)  that  is  not  utilized  in  ex¬ 
isting  approaches.  It  is  worth  noting  that  relational  search  is  the  first  task  concerning 
relational  similarity  in  which  complementary  rank  can  be  exploited  and  therefore  be  in¬ 
vented.  In  other  tasks  sueh  as  similarity  measuring  [  1 ,5],  complementary  rank  does  not 
appear  beeause  in  those  tasks,  the  four  words  in  the  two  pairs  (A,  B)  and  (C,  D)  are  all 
given.  On  the  other  hand,  in  relational  seareh  or  tasks  in  which  one  or  more  words  are 
not  given,  we  ean  define  complementary  rank  to  represent  the  strength  of  the  relation 
between  the  candidate  word  and  the  input  query  word. 

It  is  worth  noting  that  the  evaluation  using  SAT  benchmark  gives  an  interesting  cri¬ 
terion  for  evaluating  performance  of  a  relational  search  engine,  which  can  not  be  easily 
evaluated  using  normal  criteria  sueh  as  F-score  or  MRR  (mean  reciprocal  rank). 

6  Conclusion 

We  implemented  relational  search  by  using  web  seareh  engine  and  proposed  a  rank¬ 
ing  method  for  relational  search.  There  are  some  noisy  candidate  words  in  the  initial 
ranking  of  relational  seareh  results.  To  eliminate  noisy  candidate  words  from  the  ini¬ 
tial  ranking,  we  used  a  symmetric  property  and  complementary  rank.  By  using  these 
features,  we  eould  improve  4.2%  of  precision.  This  shows  that  our  proposed  method 
of  using  symmetric  property  is  effective  for  improving  eorreet  rate  on  SAT  dataset  and 
ranking  relational  search  results. 
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Abstract.  An  interday  financial  trading  system  with  a  predictive  model 
empowered  by  a  novel  brain-inspired  evolving  Mamdani-Takagi-Sugeno 
Neural-Fuzzy  Inference  System  (eMTSFIS)  is  proposed  in  this  paper.  The 
eMTSFIS  predietive  model  possesses  synaplie  mechanisms  and  information 
processing  capabilities  of  the  human  hippocampus,  resulting  in  a  more  robust 
and  adaptive  forecasting  model  as  compared  to  existing  econometne  and 
neural-fuzzy  techniques.  The  trading  strategy  of  the  proposed  system  is  based 
on  the  moving-averuges-convergenee/divergence  (MACD)  principle  to  generate 
buy-sell  trading  signals.  By  introducing  forecasting  capabilities  to  the 
computation  of  the  MACD  trend  signals,  the  lagging  nature  of  the  MACD 
trading  rule  is  addressed.  Experimental  results  based  on  the  S&P500  Index 
confirmed  that  eMTSFIS  is  able  to  provide  highly  accurate  predictions  and  the 
resultant  system  is  able  to  identify  timely  trading  opportunities  while  avoiding 
unnecessary  trading  transactions.  These  attributes  enable  the  eMTSFIS-based 
trading  system  to  yield  higher  multiplicative  returns  for  an  investor. 

Keywords:  evolving  neural-fuzzy  inference  system,  Mamdani-Takagi-Sugeno 
(MTS)  fuzzy  modeling,  human  hippocampus,  time-series  prediction,  financial 
trading  system,  moving-averages-eonvergenee/divcrgcnce  (MACD),  S&P500. 


1  Introduction 

The  fundamental  approach  to  financial  trading  is  to  identify  movement  trends  and 
turning  points,  and  subsequently  make  a  decision  to  enter  or  exit  the  financial  market. 
Generally,  the  investor  will  maintain  an  investment  position  until  evidence  indicates 
that  the  trend  has  reversed,  of  which,  another  decision  will  be  made  to  take  advantage 
of  the  trading  opportunity  that  arises.  Many  investors  rely  on  financial  market  analysis 
techniques,  which  can  be  broadly  categorized  as  fundamental  analysis  and  technical 
analysis ,  to  formulate  their  trading  decisions.  Fundamental  analysis  focuses  on  the 
study  of  economic  forces  that  affect  supply  and  demand,  for  the  purpose  of 
forecasting  the  future  price  trends  and  deciding  the  long-term  investment  strategy  [  1). 
In  contrast,  technical  analysis  bases  its  decision-making  on  historical  financial  data, 
such  as  price  and  volume  [2].  Many  financial  theoreticians  doubt  the  possibility  of 

B.-T.  Zhang  and  M.A.  Orgun  (Eds  )  PRICAI  2010.  LNAI  6230.  pp.  601-607.  2010. 

©  Springer-' Verlag  Berlin  Heidelberg  2010 


602 


W.L.  Ho,  W.L.  Tung,  and  C.  Quek 


using  technical  analysis  to  prediet  the  financial  market  on  the  basis  of  the  “Efficient 
Market  Hypothesis”  (EMH)  [3],  which  imply  that  it  is  impossible  to  consistently 
outperform  the  market  by  using  any  information  that  is  already  available  to  the 
market.  Despite  the  deeply  entrenched  beliefs  of  EMH,  there  has  been  substantial 
evidence  |4,  5]  on  the  predictability  of  financial  markets  using  teehnieal  analysis. 

This  paper  proposes  the  use  of  a  brain-inspired  incremental  neuro-fu/zy  system 
named  the  evolving  Mamdani-Takagi-Sugeno  neuro-fuzzy  inference  system 
(eMTSFIS)  [6]  to  predict  the  financial  market  and  to  investigate  the  profitability  of 
the  derived  trading  system  using  historical  data  of  the  S&P500  market  index. 


2  eMTSFIS:  The  Evolving  Mamdani-Takagi-Sugeno 
Neural-Fuzzy  Inference  System 

The  rule-generating  procedure  of  the  eMTSFIS  model  computationally  mimics  the 
human  hippoeampus,  which  is  capable  of  a  neurogenesis  process  [7]  that  has  been 
regarded  as  the  primary  mechanism  used  to  resolve  the  learning  stability-plasticity 
dilemma  in  the  human  brain  [8],  via  a  recall  comparator  and  novelty  detection 
mechanism.  Details  of  the  rule-generating  procedure  of  eMTSFIS  are  reported  in  [6]. 
In  addition,  the  human  hippoeampus  maintains  its  acquired  knowledge  using  two 
primary  synaptic  mechanisms:  long-term  potentiation  (LTP)  [9]  and  long-term 
depression  (LTD)  [10].  LTP  is  responsible  for  the  learning  and  reinforcement  of 
memory  traecs  in  the  hippoeampal  formation.  LTD,  on  the  other  hand,  is  the 
meehanism  for  forgetting  learnt  information.  Computationally,  the  eMTSFIS  model 
mimics  these  neural  meehanisms  via  the  use  of  fuzzy  rule  potentials  with  the  LTP  and 
LTD  concepts  to  construct  a  set  of  evolving  IF-THEN  Mamdani  fuzzy  rules  to  model 
non-stationary  data  generating  processes  [6]. 

The  use  of  eMTSFIS  to  model  financial  trends  allows  a  human  investor  to  examine 
the  inherent  trend  information  extracted  from  the  historical  observations  via  highly 
interpretable  fuzzy  rules.  Moreover,  eMTSFIS  ean  mitigate  the  effects  of  noise 
artifacts  on  the  computed  price  predictions  as  it  employs  gaussian-shaped  fuzzy  sets 
to  model  (generalize)  the  characteristics  of  the  past  price  movements.  As  such,  a 
human  trader  will  be  able  to  develop  a  better  understanding  of  the  underlying 
characteristics  of  the  observed  pnee  movements  and  make  better  and  informed  trading 
decisions  to  maximize  his  investment  profits. 


3  Financial  Trading  System  Using  Real-World  Data 

In  this  paper,  a  financial  trading  system  with  no  predictive  model  and  a  financial 
trading  system  with  eMTSFIS  as  the  predictive  model  are  introduced.  In  both 
systems,  the  trading  signal  at  time  t  is  represented  by  F(t)y  where  F(t)  G  {  L-l )  with  1 
and  -1  representing  the  buy  and  sell  signal  respectively.  The  performances  of  the 
various  trading  strategies  studied  in  this  paper  are  defined  by  the  portfolio  terminal 
value  R(T)  measuring  the  wealth  created  by  the  respective  trading  strategies  using  the 
notion  of  multiplicative  returns  [  1 1  ]  as  shown  in  equation  ( 1 ). 
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/?(/)  =  {!  +  F(t  - 1  )r(/)}  {I  -  F{t  - 1)|} .  t=\...,T  (l) 

where  =  y(t  - 1))-1 ;  the  prices  of  the  security  being  traded  at  time  t  and  ( t - 
/)  are  denoted  as  y(t)  and  y(t-I)  respectively;  F(t)  is  the  action  from  a  trading  system 
at  time  t  and  is  defined  using  equation  (2)  or  (3);  and  S  is  the  transaction  cost  and  is 
assumed  to  be  a  fraction  of  the  transacted  price  value. 

The  financial  trading  system  without  a  predictive  model  (TS-WOP)  is  shown  in 
Figure  2.  In  this  system,  the  trading  signal  F(t)  is  derived  using  the  MACD  trading 
rule  [2]  of  equation  (2): 


Fig.  2.  The  financial  trading  system  using  MACD  with  no  predictive  model 


I  ,  MACD  12.26  signal  crosses  above  EM Ag  trigger  line 


F(t)  = 


- 1  ,  MACD  12.26  signal  falls  below  EM  A9  trigger  line 

F(i  -\)  #  otherwise 


(2) 


where  MACD12.26  denotes  MACD  employing  the  12-days  and  26-days  exponential 
moving  averages  (EMA)  [2]  as  the  fast  and  slow  signals  respectively;  and  EMA9 
denotes  the  9-days  EMA  of  MACD  that  is  used  in  place  of  the  zero  reference  line  as 
the  trigger  to  generate  the  buy-sell  trading  signals. 

The  proposed  financial  trading  system  with  eMTSFIS  as  a  predictive  model  is  shown 
in  Figure  3.  This  system  seeks  to  address  the  lagging  nature  of  the  MACD  trading  rule 
and  to  enhance  its  timeliness  in  spotting  trading  opportunities  by  introducing  forecasting 
capabilities  to  the  computation  of  the  underlying  trend  signals.  The  three  most  recent 
daily  closing  prices  [i.e.,  Closing t-2 ),  Clositig( /- / ),  Closing(t)]  are  used  as  inputs  for  the 
eMTSFIS  predictive  model  to  forecast  the  future  closing  price  at  V  days  later  |i.e.. 
Closing* (t+V)].  The  eMTSFIS  predictive  model  is  trained  using  supervised  learning  [6] 
on  a  set  of  historical  S&P500  daily  closing  price  training  samples.  The  trained 
eMTSFIS  is  then  employed  to  predict  a  set  of  out-of-sample  closing  levels.  All  the 
predicted  closing  prices  are  then  fed  into  the  trading  model,  which  computes  the 
predicted  MACD’  and  generates  the  trading  signal  F(t)  using  equation  (3). 


I 


F(r-I) 


,  M  ACD'^^ft+F)  signal  crosses  above  EM  A9  trigger  line 
,  MACD'I2.26(t+V0  signal  falls  below  EM  A9  trigger  line 
,  otherwise 


(3) 


To  demonstrate  the  forecasting  capabilities  of  the  proposed  eMTSFIS  model,  the 
prediction  results  of  eMTSFIS  are  benchmarked  against  two  well-established 
evolving  neural-fuzzy  systems,  i.e.  EFuNN  [12]  and  DENF1S  [13]  as  well  as  two 
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Fig.  3.  The  financial  trading  system  with  the  eMTSFIS  predictive  model 


econometric  forecasting  models,  i.e.  the  autoregressive  moving  average  (ARM A)  [  14] 
model  and  the  Random  Walk  model  [15].  Finally,  the  performance  of  the  proposed 
eMTSFIS-based  financial  trading  system  shown  in  Figure  3  is  benchmarked  against 
those  of  a  simple  buy-and-hold  strategy,  a  trading  system  with  no  prediction,  a  trading 
system  with  perfect  prediction  and  two  trading  systems  with  EFuNN  and  DENFIS  as 
the  respective  predictive  model  using  historical  data  of  the  S&P500  market  index. 

All  the  aforementioned  predictive  models  are  constructed  as  3-input- 1 -output 
systems  configured  with  default  parameters.  For  the  trading  systems  employing  a 
predictive  model,  the  trading  signals  are  generated  using  equation  (3).  Correspondingly, 
the  trade  signals  for  the  trading  system  employing  perfect  predictions  are  also  generated 
using  equation  (3)  but  the  predicted  ClosingYz+l')  prices  are  now  replaced  with  the 
actual  Closing(7+  V)  prices.  The  final  portfolio  value  of  each  benchmarked  trading 
system  is  computed  using  equation  (1 ),  with  the  initial  portfolio  value  /?(0)  =  1 .0  and  the 
transaction  cost  rate  5  =  0.2%. 


3.1  Forecasting  and  Trading  of  the  S&P500  Market  Index 

In  this  experiment,  the  benchmarked  trading  systems  are  evaluated  using  the  S&P500 
market  index.  The  experimental  data  is  obtained  from  Yahoo  Finance  and  consists  of 
15637  daily  closing  values  spanning  the  period  of  05  January  1950  to  1  1  December 
2009.  The  training  data  set  for  the  various  predictive  models  consists  of  the  initial 
7500  index  values  while  the  out-of-sample  data  set  contains  the  remaining  8137  index 
values.  The  three  most  recent  daily  closing  index  values  are  given  as  inputs  to  the 
various  predictive  models  to  forecast  the  closing  index  value  five  days  later. 

As  observed  from  Table  1,  the  eMTSFIS  predictive  model  has  superior  forecasting 
performance  as  compared  to  the  econometric  models  (i.e.  ARM  A  and  Random  Walk) 
and  the  evolving  neural-fuzzy  systems  DENFIS  and  EFuNN.  According  to  [16], 


Table  1.  Forecasting  results  of  different  predictive  models  on  the  S&P500  index 


Predictive  Model 

Test  Error  (RMSE) 

Number  Of  Rules 

ARM  A  (1,1) 

10.369 

N.A. 

Random  Walk 

9.9203 

N.A 

DENFIS 

0.7203 

6 

EFuNN 

0.7313 

213 

eMTSFIS 

0.3901 

38 
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ARM  A  and  Random  Walk  have  poor  forecasting  results  because  of  their  linear 
structures  and  other  inherent  limitations.  On  the  other  hand,  despite  having  a  larger 
rule-base  as  compared  to  DENFIS.  the  Mamdani-type  fuzzy  rules  identified  by 
cMTSFIS  arc  highly  intcrpretable.  This  contrasts  favourably  to  the  TSK-typc  fu//.y 
rules  in  DENFIS,  which  are  difficult  to  comprehend. 

Table  2  shows  the  overall  performances  of  the  benchmarked  trading  systems  as 
reflected  by  their  portfolio  end  value  R(T)  and  the  square  of  the  Pearson  correlation 
(SPC)  between  the  actual  and  predicted  index  series.  In  Table  2,  B&H  denotes  the 
buy-and-hold  strategy;  TS-WOP  and  TS-PP  refer  to  the  trading  systems  with  no 
prediction  and  with  perfect  predictions  respectively;  TS-DENFIS,  TS-EFuNN  and 
TS-eMTSFIS  are  the  respective  trading  systems  employing  DENFIS,  EFuNN  and 
cMTSFIS  as  the  predictive  model. 


Table  2.  Performances  of  the  different  trading  systems  using  the  S&P500  index 


Trading  System 

R(T) 

SPC 

B&H 

10.40 

N.A. 

TS  -  WOP 

1 1 .50 

N.A. 

TS  -  DENFIS 

13.66 

0.9763 

TS  -  EFuNN 

12.48 

0.9044 

TS  -  eMTSFIS 

15.31 

0.9958 

TS-PP 

59.57 

TO 

As  shown  by  the  multiplicative  returns  generated  for  an  investor  employing  the 
various  trading  strategies  in  Table  2,  the  trading  system  with  the  proposed  eMTSFIS 
as  a  predictive  model  (TS-eMTSFIS)  outperformed  the  simple  buy-and-hold  strategy, 
the  trading  system  with  no  prediction  and  the  trading  systems  employing  DENFIS  and 
EFuNN  as  predictive  models.  The  superior  performance  of  TS-eMTSFIS  can  be 
analyzed  by  inspecting  its  trading  signals  as  shown  in  Figure  4.  Based  on  region  (a)  of 
Figure  4,  TS-eMTSFIS  is  able  to  enter  into  a  long  (buy)  position  at  a  lower  price  and 
at  an  earlier  time  than  the  trading  system  with  no  prediction  (TS-WOP)  due  to  an 
accurate  forecast  by  the  eMTSFIS  model.  Similarly  in  region  (c)  of  Figure  4,  TS- 
cMTSFIS  is  able  to  secure  a  short  (sell)  position  at  a  higher  price  and  at  an  earlier 
time  than  TS-WOP.  These  well-timed  trades  translate  to  a  higher  multiplicative  return 
R(T)  as  compared  to  other  trading  strategies  shown  in  Table  2.  In  addition,  the  closing 
index  values  predicted  by  eMTSFIS  have  a  higher  correlation  to  the  actual  closing 
levels  when  benchmarked  to  DENFIS  and  EFuNN.  This  translates  to  improved 
decision  making  and  enhances  the  timeliness  of  the  trading  system  TS-eMTSFIS  in 
spotting  trading  opportunities,  thus  contributing  to  a  higher  multiplicative  return  R(T). 
Moreover,  region  (b)  of  Figure  4  showed  that  TS-eMTSFIS  is  able  to  avoid  some 
unnecessary  trading  transactions,  thus  reducing  the  transaction  costs  incurred.  This 
can  be  attributed  to  the  ability  of  eMTSFIS  to  generalize  the  characteristics  of  the  past 
index  movements,  thus  mitigating  the  effects  of  noise  artifacts  on  the  computed 
MACD  series  that  determines  the  corresponding  trading  signals. 


606 


W.L.  Ho,  W.L.  Tung,  and  C.  Quek 


1200 


'/l1 

f  V 


TS  WOP 
TS  eMTSFIS 


1\2J 


Trading  Signal 


1  24 
x  104 


Fig.  4.  Trading  signals  on  S&P  500  index  from  time  t=l  1200  to  12400 

4  Conclusion 

A  financial  trading  system  employing  a  novel  brain-inspired  evolving  Mamdani- 
Takagi-Sugeno  neuro-fuzzy  inference  system  (eMTSFIS)  predictive  model  is 
proposed.  In  this  paper,  eMTSFIS  is  used  to  model  and  forecast  the  daily  closing 
index  values  of  the  S&P500  market  index.  Experimental  results  confirmed  that 
eMTSFIS  is  able  to  provide  highly  accurate  predictions  and  identify  timely  trading 
opportunities  while  avoiding  unnecessary  trading  transactions.  Collectively,  these 
attributes  enable  the  proposed  eMTSFIS-based  trading  system  to  yield  higher 
multiplicative  returns  for  an  investor. 
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Abstract.  This  paper  studies  the  problem  of  multi-agent  planning  in 
the  environment  where  agents  may  need  to  cooperate  in  order  to  achieve 
their  individual  goals  but  they  do  so  only  if  the  cooperation  is  benefi¬ 
cial  to  each  of  them.  We  assume  that  each  agent  has  a  reward  function 
and  a  cost  function  that  determines  the  agent's  utility  over  all  possible 
plans.  The  agents  negotiate  to  form  a  joint  plan  through  a  procedure  of 
alternating  offers  of  joint  plans  and  side-payments.  We  propose  an  algo¬ 
rithm  that  generates  an  agreement  for  any  given  planning  problem  and 
show  that  this  agreement  maximizes  the  gross  utility  and  minimizes  the 
distance  to  the  ideal  utility  point. 

Keywords:  multi-agent  planning,  joint  plan,  side-payment,  bargaining. 


1  Introduction 

Multiageiit  planning  has  been  an  emerging  research  topic  in  recent  years  in 
the  area  of  Artificial  Intelligence  1,2,3, 4, 5, 7  .  Most  existing  studies  on  rnultia- 
gent  planning  involve  planning  for  common  goals,  plan  coordinating,  plan  merg¬ 
ing  and  synchronized  planning.  Most  of  the  existing  frameworks  on  multiagent 
planning  are  based  either  on  the  assumption  that  all  agents  have  common  goals 
therefore  will  be  fully  cooperative  for  a  joint  plan  or  on  the  assumption  that 
all  agents  must  reveal  their  private  information,  such  as  goals,  rewards,  costs 
and/or  utilities,  to  other  agents  or  arbitrators.  In  many  real-world  situations, 
none  of  the  assumptions  satisfies,  It  is  a  great  challenge  to  find  a  joint  plan  for 
a  multiageiit  system  in  which  all  agents  are  self-interested  with  individual  goals 
and  private  information. 

In  this  paper,  we  propose  a  solution  to  multiageiit  planning  based  on  the 
following  scenario: 

—  Each  agent  in  the  system  has  its  own  goals,  reward  of  goal  achievement  and 
costs  of  actions. 

—  All  agents  are  self-interested  but  profit-driven.  An  agent  only  concerns  about 
its  own  goals.  However,  to  attract  other  agents  to  join  its  plan,  an  agent 
may  offer  the  other  agents  some  payment  (named  side- payment)  if  the  other 
agents  agree  on  the  joint  plan. 

This  research  was  supported  by  the  Australian  Research  Council  through  Linkage 

Project  LP0777015. 
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—  An  agont  can  make  a  proposal  of  a  plan  with  actions  from  the  other  agents 
or  its  own  (therefore  a  joint  plan)  and  a  side-payment  scheme.  An  agent 
can  accept  other  agents'  proposal  if  the  net  profit  it  receives  from  this  plan 
(possible  reward  minus  costs  plus  side-payment)  surpasses  any  of  its  own 
plans,  reject  the  proposal  by  making  a  counter  proposal. 

Based  on  the  about  scenario,  we  propose  a  planning  procedure,  named  Planning 
Proeedmv  based  on  Bargaining  (PPB).  The  procedure  is  based  on  an  alternating- 
offer  model  of  bargaining  for  two-agent  bargaining  situations  [8].  The  planning 
procedure  proceeds  in  several  rounds.  In  each  round,  only  one  agent  can  make 
a  proposal,  which  consists  of  a  plan  and  a  side  payment  scheme.  If  the  other 
agent  accepts  the  proposal,  the  procedure  terminates  and  the  current  proposal 
becomes  the  final  agreement:  otherwise,  it  is  the  other  agent’s  turn  to  make  a 
proposal.  We  show  that  PPB  is  correct,  complete,  and  terminating. 

This  paper  is  structured  as  follows.  Firstly,  we  introduce  some  formal  pre¬ 
liminaries  to  represent  the  planning  problems.  Secondly,  we  define  the  concept 
of  plan  proposals  and  bargaining  mechanism.  Thirdly,  we  propose  a  planning 
procedure  based  on  the  bargaining  mechanism  and  show  its  properties.  Finally, 
we  discuss  related  work  and  future  research  directions. 


2  Planning  Domains  and  Problems 

In  this  section  we  present  a  model  of  dynamic  systems  based  on  which  the 
planning  problems  that  will  be  dealt  with  in  this  paper  is  described. 

A  mufti-agent  planning  domain,  is  a  tuple  £  =  (<S,  so,  #,  A,  T),  where  S  is  a 
set  of  states,  so  €  ^  is  the  initial  state,  <I>  is  a  non-empty  set  of  agents,  A  is  a 
set  of  actions,  and  T  C  Sx  d>  x  A  x  S  represents  the  state  transition  relation. 
(s,p,a.sf)  €  T  means  that  p  can  perform  action  a  at  state  s  and  bring  about 
s'  as  one  of  the  possible  result  states. 

For  simplicity,  we  assume  in  this  paper  that  |{.s'  €  <5  :  (s,p.a.s;)  €  T}|  <  1 
for  each  (.s,  p,  a)  in  Sx<I>xA  i.e.,  we  only  consider  deterministic  state  transitions. 
All  actions  are  assumed  to  be  asynchronous,  that  is  to  say.  at  most  one  agent 
performs  an  action  at  each  state. 

Definition  1.  Given  a  planning  domain  £,  a  plan  7 r  for  £  is  a  finite  sequence 
in  the  form  {p\ ,  a\);  (p2,  ^2);  *  *  * » Q>n}?  where  pi  €  <£  and  a,  €  .4.  The 

plan  7T  is  called  to  be  applicable  to  £  if  there  exist  s\ .  .s-2 _ ,  sn  €  <5  sneh 

that  ,  <£j,  a*,  *•;)  €  T  for  all  ()</<//.  sn  and  n  are  referred  to  as  the 
last  state  and  the  length  of  the  plan ,  denoted  by  LSTATE(7r)  and  Lf:ngth(7t), 
respectively.  AGTS(7r)  denotes  the  set  of  agents  that  am  involved  in  i r, 
Ac;ts(7t)  =  :  p  appears  in  7r}. 

Give  a  planning  domain,  assume  that  each  agent  lias  its  own  goals,  rewards  if 
the  goals  are  achieved  and  costs  of  actions.  A  multi-agent,  planning  problem  is  to 
find  a  joint  plan  that  can  achieve  the  goals  of  all  the  agents  meanwhile  maximize 
their  rewards  and  minimize  their  costs  of  actions. 
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Definition  2.  A  planning  problem  is  a  tuple  V  =  {E.Q.  r,  c),  where 

—  E  =  (5,so,^,AT)  is  a  planning  domain . 

—  Q  :  <P  2s  is  a  goal  Junction  that  specifies  each  agent's  goal  states. 

—  r  :  0  — ►  Z+  is  a  reward  function  that  assigns  to  each  agent  a  non- negative  in¬ 
teger ,  representing  the  reward  an  agent  can  received  if  its  goals  are  achieved. 

—  c  :  <P  x  A  —>  Z+  is  «  eos$  function  that  specifies  the  cost  of  each  action  to 
each  agent. 

Note  that  for  every  agent  p,  G(p),  r(p),  and  =  c.(p<a)  are  <^’s  private  in¬ 
formation.  Therefore  we  write  p.Q.p.r.  and  p.c  instead  of  G(p),r(p).  and  c^, 
respectively,  to  indicate  that  these  functions  arc  implemented  in  agent  p. 

Given  a  planning  problem  ?,  let  f2(V)  denote  the  set  of  all  the  applicable  plans 
for  the  planning  domain  of  V .  For  each  agent  p  G  <I>  and  7T  =  ((^1,  aj);  {g>2,(i2)\  •  •  • 
G  f2(V),  we  define  p' s  utility  of  n  as  follows: 


LENGTH(tt) 

RF,wv,(7r)  -  Cosx^^j.aj) 


i  —  1 


where  Rew^(7t)  =  p.r  if  Lstate(7t)  G  p.Q:  0  otherwise  and  CosTv?(<^i,  o.f)  = 
p.cfii)  if  p  =  0  otherwise. 

We  use  irZ  to  denote  the  maximal  value  of  utility  that  p  can  achieve  without 

other  agent’s  involvement,  i.e.,  wZ  =  max  {rx<*(7r)|  Agts(7t)  C  {<^}}.  wZ  acts 

W  ncn(v)  v 

as  p's  bottom  line  for  bargaining.  In  other  words,  p  is  willing  to  cooperate  with 

other  agents  only  if  the  cooperation  can  bring  to  p  a  utility  value  which  is  strictly 

greater  than  u £  (individual  rationality).  Let  [2±(V)  be  the  set  of  plans  which 

arc  individual  rational,  i.e.,  fiL{V)  —  {7r  G  f2(V)\(Vp  G  >  u£}. 

Similarly,  we  use  uj  to  denote  the  maximal  utility  the  agent  p  can  gain  with 

respect  to  the  current  planning  situation  provided  all  other  agents  are  individual 

rational,  i.e.,  uf  =  max  uA 7r).  Indeed  xiZ  is  the  ideal  outcome  of  p. 


3  Bargaining  Situation 

To  simplify  the  presentation  of  our  approach,  we  will  focus  on  two-agent  planning 
problems,  i.e.  <f>  =  1 ,  } .  We  call  utility  pair  the  ideal  point , 

denoted  by  InP('P).  For  any  j  G  {  —  1,1}  and  {nf.7r}  C  $2l(V),  if  1^(7^)  > 
l/vj(7r)’  then  agent  pj  will  prefer  7r'  to  n.  If  p-3  does  not  agree  to  perform  7r', 
then  p3  can  propose  a  side  payment  such  that,  the  amount  proposed  to  p-3  is 
not  greater  than  n^j{ixf)  —  uv?j(7r)  —  1  If  this  proposal  docs  not  wrork,  then  pj 
must  abandon  nf  and  consider  n  instead. 

Definition  3.  A  proposal  to  V  is  a  pair  p  =  (7T,£)  such  that  n  is  a  plan  for 
the  planning  domain  ofV,  f  :  <?  — *  Z  is  a  side  payment  function  which  satisfies 
^(v?)  =  F()r  anV  k  €  £*  denotes  the  side  payment  function  that 

assigns  k  to  p and  ~k  to  p-i.  For  each  p  G  the  utility  of  p  to  p  is  defined 
as:  u^(p)  =  uv( n)  +  Z(<p). 
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Pro(V)  denotes  the  set  of  possible  proposals.  Proposal  p  =  (7t,£a-)  £  PRO('P)  if 
and  only  if:  (1)  7r  G  QL{V)  and,  (2)  and  nVl (p)  > 

In  order  to  reach  an  agreement  (i.e..  a  proposal  accepted  by  the  two  agents), 
the  agents  can  bargain  with  each  other  by  proposing  proposals  one  by  one.  Once 
an  agreement  p —  (7r.£)  is  reached,  all  the  agents  in  Agts(7t)  will  cooperate  to 
perform  7r,  and  the  givss  utility ,  i.e.,  t)e  redistributed  among 

<l>  such  that  each  agent  ^'s  real  income  is  uif(p).  For  a  proposal  p  to  V,  we  use 

Dis(p)  =  yj{u^  ,  —  w^_,(p))2  +  (hJj  -  a*?,  (p))2  to  denote  the  distance  between 

IdP('P)  and  the  utility  pair  derived  from  p.  In  other  words,  Dis(p)  describes  the 
concessions  made  by  the  two  agents  to  achieve  p.  This  leads  to  the  notion  of 
solution  which  characterizes  the  Pareto  optimal  proposals  which  entail  minimal 
concessions. 

Definition  4.  Proposal  p  is  a  solution  to  V  if  it  satisfies  the  following  three 
conditions: 

Individual  rationality:  p  G  Pro(V); 

Pareto  optimality:  there  is  no  proposal  //  G  Pro(V)  such  that  pf)  >  uv?(p) 
for  all  ip  G  d>; 

Minimal  concession:  Dis(p)  =  MlN{Dis(p')|p/  G  Pro('P)}. 

4  The  Bargaining  Mechanism 

In  this  section,  we  present  a  planning  procedure  based  on  bargaining,  and  show 
its  properties.  The  procedure  is  used  for  two-agent  planning  settings,  in  which 
all  utility  functions  and  goals  are  private'  information  and  cannot  be  revealed. 
The  planning  procedure  based  on  bargaining  (PP13)  is  defined  as  follows. 

step  1:  Each  agent  p  G  calculates  the  set  of  plans  Imps ^  > 

n £)  A  (Length (7r)  <  J)}1.  and  sends  baps ^  to  an  arbitrator  p* . 
step  2:  p*  calculates  f2  (V)  —  bups ^  t  Hbups^ ,.  If  i?1  (V)  =  0.  then  p*  an¬ 
nounces  the  result  of  the  procedure  is  failure,  and  the  procedure  stops.  Oth¬ 
erwise.  p*  sets  the  set  of  plans  to  be  considered  p.s(O)  :=  f2L(V),  i  : 
Rand({— 1,  l})2,  sends  ps(0)  and  i  to  each  p  G  (P. 
step  3:  Each  pj  €  <I>  sets  its  proposal  being  considered  p^.  (0)  :=  (RAND(p/s^  ), 
it}) i  where 

pls^j  =  arg  max  (7r), 

7t€;>s(0) 

and  sends  p^(0)  to  p-y  Let  t  :=  0,  0  \  :=  0,  and  0\  :=  0. 
step  4:  If  uifi hv  ,(/))  >  n<p,(/V.(0)>  then  sends  done  to  goto  step  7. 
Otherwise  px  sets  ps(t  4-  1)  :=  {n  G  /^(0lvV«(7r)  >  *Vi(/V  ,(<))}>  and 
sets />„_,(*  +  L)  ;=/V-.(0- 

1  We  adopt  and.  for  ease  of  presentation  flirt  her  strengthen  the  simple  agents  assump¬ 
tion,  requiring  each  plan  to  he  bounded  in  length  by  a  fixed  5 . 

2  Given  a  set,  Rand  returns  an  element  of  the  set  randomly. 
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step  5:  Suppose  pVi (t)  =  {n.£).  If  ps(t  +  1)  =  0  or  u^,i(pVt (£))  >  MAx{u¥>,(7r) 
|tt  €  ps(t  +1)}  then  9 *  :=  0 ,•  +  1  and  pi  sets  pVf(t  +  1)  :=  (7r,£')  such  that 
C(y>i)  =  C(<Pi)  -  1  and  ^'(v’-i)  =  £(<*>-«)  +  1-  Otherwise  p,  sets  p^i (t  +  1)  := 
(RAND(pis^t),f0>,  where 

pis'.  =  arg  max  u^A-n'). 

7T/€p9(f  +  l) 

step  6:  ifi  sends  ;^f  (t  +  1)  to  p-  t.  Let  t  :=  t  4*  1  and  i  :  =  —  i.  Return  to  step  4. 
step  7:  Suppose  p ^  t(t)  —  (i Then  p*  sets  j  \=  Rand({  — 1,  1}),  and 
announces  p  =  (n*  ,  £')  is  the  result  of  the  procedure,  where  £{pj)  =  £*(<£.?)  + 
Q#j  ~  l0.5*uu3,  ^(p-j)  =  £*(<£-.?)  +0v?_i  —  r0.5 *  uf1,  and  ic  =  +  0^,. 

If  we  observe  this  procedure,  we  remark  that,  for  all  j  £  { —  1 , 1 } ,  p3  only  sends 
proposals  to  p-3  and  p*.  So  for  all  n  €  (P),  p~3  and  p*  can  not  know  i/Vj  (7r) 

(and  of  course,  also  pj.Q,  pj.r,  and  p3.c)  during  the  procedure. 

We  now  show  the  properties  of  PPB.  The  first  key  result  states  that  PPB 
always  terminates  in  polynomial  time. 

Theorem  1.  Under  the  simple  agents  assumption,  PPB  is  guaranteed  to  termi¬ 
nate ,  and  it  is  polynomial  in  rnin{u*  ^t/.^  },  where  —  7mn{tiv.(7r)|7r  € 

lil(V)  and  uip_  t( n)  = 

The  second  property  states  that  if  there  is  a  solution  for  the  planning  problem, 
then  t lie'  proposed  procedure  will  not  fail. 

Theorem  2.  failure  is  the  result  of  PPB  if  and  only  if  there  is  no  solution  to  V . 
The  following  theorem  shows  that  the  resulting  proposal  is  a  solution  to  V. 
Theorem  3.  If  PPB  returns  p  4  failure ,  then  ]>  is  a  solution  to  P . 

5  Conclusion  and  the  Related  Work 

In  this  paper,  we  have  proposed  a  model  of  rnulti-agcnt  planning  problems  based 
on  a  bargaining  mechanism.  We  have  considered  a  class  of  planning  situations  in 
which  each  agent  has  its  own  goals,  reward  function  and  cost  function.  Agents 
bargain  over  joint  plans  with  possible  side  payments.  We  have  proposed  a  plan¬ 
ning  procedure  which  possesses  the  following  properties:  (1)  the  procedure  always 
terminates  in  polynomial  time;  (2)  for  any  given  planning  problem,  if  the  set  of 
individual  rational  plans  is  non-empty,  the  procedure  can  generate  a  joint  plan 
at  its  termination;  (3)  the  side  payment  associated  with  the  resulting  plan  leads 
to  a  bargaining  solution  that  is  individual  rational  and  Pareto  optimal  with 
minimal  distance  to  the  ideal  point. 

Most  of  the  early  work  on  multiageiit  planning  is  built  up  on  fully  cooper¬ 
ative  multi-agent  systems  such  as  the  multi-entity  model  [7]  and  MA-STRIPS 


3  and  lj  denote  the  ceil  and  floor  function  on  real  numbers,  respectively. 
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planning  [4  .  Recently,  game-theoretic  approaches  were  applied  to  the  problem 
of  mutliagent  planning  so  that  common  plans  or  joint  plans  can  bo  generated 
among  self-interested  agents  1,2].  In  particular,  Brafman  et  nl.  formalized  a 
niulit agent  planning  problem  into  a  planning  game  which  captures  a  rich  class 
of  planning  scenarios  [3].  However,  these  existing  works  on  multiagent  planning 
are  bast'd  on  either  the  assumption  that  all  agents  have  common  goals  or  the 
assumption  that  all  agents  must  reveal  their  private  information,  such  as  goals, 
rewards,  costs  and/or  utilities,  to  other  agents  or  arbitrators.  In  contrast.,  our 
approach  to  multiagent  planning  is  based  on  a  bargaining  mechanism,  which 
assumes  that  goals,  rewards  and  costs  are  private  information  and  will  not  be 
revealed  to  am  other  agents  or  arbitrators.  In  fact,  these  pieces  of  information 
determine  the  bargaining  power  of  an  agent. 

As  future  work,  we  will  extend  the  present  planning  model  to  n-agent  systems 
(7/  >  2).  The  main  challenge  of  the  extension  is  how  to  offer  side- payment  to  each 
other  agent  in  the  situation  of  unknowing  other  agents’  demands  (obviously  equal 
distribution  does  not  work).  Secondly,  it  is  interesting  to  extend  the  current  work 
to  nondeterministic  cases.  This  requires  to  redefine  the  solution  concept  and  the 
Co  Achieve  algorithm  in  strong  [6]  or  probabilistic  style.  Finally,  more  general 
mechanisms  can  be  designed  for  multi-agent  planning  to  deal  with  changing 
goals,  incomplete  information  [9,10].  and  reasoning  agents  11,12]. 
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Abstract.  We  present  a  memory-bounded  approximate  algorithm  for 
solving  infinite-horizon  decentralized  partially  observable  Markov  de¬ 
cision  processes  (DEC-POMDPs).  In  particular,  we  improve  upon  the 
bounded  policy  iteration  (BP1)  approach,  which  searches  for  a  loc  ally 
optimal  stochastic  finite  state  controller,  by  accompanying  reachability 
analysis  on  controller  nodes.  As  a  result,  the  algorithm  has  different  op¬ 
timization  criteria  for  the  reachable  and  the  unreachable  nodes,  and  it  is 
more  effective  in  the  search  for  an  optimal  policy.  Through  experiments 
on  benchmark  problems,  we  show  that  our  algorithm  is  competitive  to 
the  recent  nonlinear  optimization  approach,  both  in  the  solution  time 
and  the  policy  quality. 


1  Introduction 

The  decentralized  POMDP  (DEC-POMDP)  is  a  popular  framework  for  model¬ 
ing  decision  making  problems  where  two  or  more  agents  have  to  cooperate  in 
order  to  maximize  a  common  payoff,  and  to  act  based  on  imperfect  state  in¬ 
formation.  While  the  DEC-POMDP  can  be  applied  to  many  domains  such  as 
network  routing  and  multi-robot  coordination,  it  is  known  to  be  intractable  for 
computing  an  optimal  policy  1  . 

In  this  paper,  we  are  interested  in  solving  infinite-horizon  DEC-POMDPs  by 
searching  in  the  space  of  fixed-size  finite  .state  controllers  (FSCs).  Specifically,  we 
represent  the  individual  policy  for  each  agent  as  a  stochastic  FSC  in  which  the 
nodes  correspond  to  action  selection  strategies  and  the  transitions  correspond 
to  observation  strategies.  There  have  been  proposed  a  number  of  methods  for 
finding  FSC  policies,  but  most  relevant  to  our  work  are  the  bounded  policy  itera¬ 
tion  for  DEC-POMDPs  (DEC-BP1)  [2]  and  the  nonlinear  optimization  approach 
(NLO)  [4], 

We  propose  an  improved  version  of  DEC-BPI  that  addresses  some  of  the 
limitations  that  prevent  the  algorithm  from  finding  an  FSC  policy  with  a  high 
quality.  Our  insight  for  the  improvement  is  based  on  the  observation  that  we 
need  different  optimization  criteria  depending  on  whether  a  controller  node  in 
FSC  is  reachable  or  not.  We  show  the  effectiveness  of  the  proposed  algorithm 
via  experiments  ou  standard  benchmark  problems. 
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2  Background 

A  decentralized  partially  observable  Markov  decision  process  (DEC-POMDP) 
is  a  multi-agent  extension  to  the  POMDP  framework.  More  Formally,  a  DEC- 
POMDP  is  defined  as  tuple  (/,  S.  bo,  {A,}.  {Z/},  T.  O.  /?)  where 

-  1  is  a  finite  set  of  agents 

-  S  is  a  finite  set  of  states  shared  by  all  agents 

bo  is  the  initial  state  distribution,  where  bo{$)  denotes  the  probability  that 
the  system  starts  in  state  s 

At  is  a  finite  set  of  actions  available  to  agent  t;  the  set  of  joint  actions  is 
demoted  as  A  =  I]?£/  d, 

Z,  is  a  finite  set  of  observations  available  to  agent  i;  the  set  of  joint  obser¬ 
vations  is  denoted  as  Z  =  ILe/ 

T  is  a  transition  function  where  T(s%  a,  .s7)  denotes  the  probability  P(s'\s,a) 
of  changing  to  state  $'  from  state  s  by  executing  joint  action  a 
O  is  an  observation  function  where  0(s,  a,  £)  denotes  the  probability  P(z\a<  s) 
of  making  joint  observation  z  when  taking  joint  action  a  and  arriving  in 
state  .s\ 

R  is  a  reward  function  where  /{(s\o)  denotes  the  shared  reward  received  by 
all  agents  when  t  aking  joint,  act  ion  a  in  state  s. 

Since  the  state  is  not  directly  observable  and  the  observations  are  local  to  each 
agent,  the  agent  chooses  actions  based  on  its  own  local  histories.  This  mapping 
from  local  observation  histories  to  actions  comprises  a  local  policy ,  arid  the  set 
of  every  agent’s  local  history  comprises  a  joint  policy. 

A  popular  representation  for  policies  in  infinite-horizon  problems  is  to  use 
stochastic  finite,  state  controllers  (FSCs).  The  local  policy  for  agent  i  is  repre¬ 
sented  as  a  stochastic  FSC  7 r,  =  (Qi,  ) .  where 

Qi  is  the  finite  set  of  controller  nodes. 

ipi  is  the  action  selection  strategy  for  each  node,  where  denotes  the 

probability  P{a\q)  of  choosing  action  a  in  node  ip 

-  //,  is  the  observation  strategy  for  each  node,  where  ?/*(</, a, z,q')  denotes  the 
probability  P{q'\q.  a.  z)  of  changing  to  node  qf  from  node  q  when  executing 
action  a  and  making  observing  c. 

The  set  of  7T,  for  each  agent  /  comprises  a  joint  policy  7r,  and  the  set.  of  nodes 
from  each  agent’s  controller  comprises  a  joint  node. 

2.1  Bounded  Policy  Iteration  for  DEC-POMDPs 

Bernstein  at  ai  [2]'s  bounded  policy  iteration  for  DEC-POMDPs  (DEC-BPI)  is 
an  extension  of  the  bounded  policy  iteration  algorithms  for  POMDPs  [3]  to  the 
multi-agent  case.  It  is  a  greedy  local  search  algorithm  that  iteratively  improves  a 
joint  stochastic  ESC  with  a  fixed  number  of  nodes  by  alternating  between  policy 
evaluation  and  improvement.  In  the  policy  evaluation  step.  DEC-BPI  computes 
the  value  function  of  the  current  joint  controller  by  solv  ing  the  Bellman  equat  ion. 
In  the  policy  improvement  step,  DEC-BPI  randomly  selects  one  of  the  nodes  of 
an  agent,  and  solves  the  linear  program  to  obtain  an  improved  controller. 
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2.2  Nonlinear  Optimization  Approach 

Amato  et  nl.  [4]’s  nonlinear  optimization  (NLO)  takes  a  more  direct  approach 
to  obtaining  an  optimal  controller.  The  problem  is  formulated  as  a  nonlinear 
program  (NLP)  and  a  state-of-the-art  NLP  solver  is  used  to  find  solutions.  Since 
the  problem  is  nonconvex,  most  of  the  NLP  solvers  yield  only  a  locally  optimal 
solution,  as  in  the  case  with  DEC-BPL 

3  Point-Based  DEC-BPI 

Before  we  present  our  algorithm,  let  us  take  a  closer  look  at  the  policy  improve¬ 
ment  in  DEC-BPL  The  linear  program  tries  to  find  better  parameters  for  a  node, 
assuming  that  we  use  the  controller  with  the  new  parameters  for  the  first  time 
step,  and  then  the  one  with  the  old  parameters  from  the  second  step  on. 

We  want  the  intermediate  FSCs  during  the  iterations  of  DEC-BPI  to  repre¬ 
sent  the  set  of  policies  that  perform  well  with  respect  to  various  reachable  state 
distributions  starting  from  60,  but  DEC-BPI  does  not  necessarily  show  this  be¬ 
havior  since  the  monotonic  improvement  condition  requires  improving  the  value 
for  all  state  distributions. 


Table  1.  Point-based  DEC-BPI 


B  SampleBeliefsQ 
repeat 

V*  «—  Evaluate^) 

C  <—  ReachableNodeStates(£,  7?,  V*) 

(f ,  7?)  <—  Improve  Policy  (7?,  V*,C) 
until  no  improvement  in  any  node  of  any  agent 


The  main  idea  behind  our  point-based  DEC-BPI  is  to  have  different  optimiza¬ 
tion  criteria  depending  on  whether  or  not  a  controller  node  is  reachable  from 
the  set  of  useful  nodes.  The  overall  algorithm  is  shown  in  Table  1,  and  in  the 
remainder  of  this  section,  we  explain  each  step  of  the  algorithm. 

3.1  Sampling  Beliefs 

Since  it  is  intractable  to  find  the  exhaustive  set  of  reachable  multiagent  beliefs 
under  the  optimal  policy,  we  approximate  the  set  by  sampling  from  a  random 
policy,  similar  to  [6].  Formally,  given  a  T-step  joint  tree  policy  and  a  joint  history 
hr  =  {&\ ,  2it  •  •  • ,  St,  zt)  of  actions  and  observations  from  time  step  1  to  T,  the 
associated  (imnormalized)  state  distribution  b(hr,  *)  is  recursively  computed  by 

b(hr,  s')  —  0(s zt)  T(st  St,  s')b(hT-i,s) 

where  Jit- 1  is  the  sub-history  from  time  step  1  to  T  —  1,  and  b(ho,  s)  —  bo(s). 
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Table  2.  Procedure  ReachaulkNodkStatks 


c*-{} 

for  each  belief  b  6  B  do 

fb  iirgnniXfi  :l/  _gi  j  Ylii.,  ^(h, s)V*(f(h),  *) 
for  each  joint  history  h  G  H  and  state  s  G  S  do 
if  b(h,s)  >  0  then 
C  <—  CU  {(fb(h),s)} 
end  if 
end  for 
end  for 
repeat 

for  all  (q  t  s)  s.t.  (q.s)  G  C  and  Tn(q,  s.  q  ,  .$')  >  0  do 

r-cu{(<f..s')} 

end  for 

until  no  more  node-state  pair  to  add 


Table  3.  The  linear  program  in  point-based  DEC-BPI  for  improving  reach¬ 
able  node  r/i.  The  variable  x(a«)  represents  ^,(qt%at),  and  x(a zt,q[)  represents 
i)i((ii.(n,2i.<ii)-  P(a~i\<i  ■,)  denotes  Hfc)6i  Vk(<ik,<n),  and  P(q'  , z  ,)  denotes 
nM<  Vk(nk,ak,Zk,q'k)- 


Variables:  .r(ut)i  x(ai%Zi,q*i) 

Objective:  Maximize  ( 

Improvement  constraints:  M.v)  G  C. 

V(q.s)  +  <  <  P{<l-l\q  i)  [.'f(di)  /?(».  «)  +  ,■>'){<•  X(a‘- 

z  ,)T(.s.  (7.  s')0(s',  a.  *')] 

Unreachability  maintenance  constraints:  \/(q,,q~,,s)  £  C  and  V((/y,.s')  $  C 

>lv  z,-q'i)P(q'-  ,\q-i.<i-  ,,  Z-,)T(s.a.  s')0(s' .a,  z)  =  0 

P  rc >babi  1  i  t  y  const  mints: 

•'■(«<)=  ■•  Vai.Si  £(J,  =  .r(«i) 

;r(a#)  >  0.  Va  x(m,  zx.  q\)  >  0 


3.2  Reachability  of  Nodes  and  States 

Once  wo  evaluate  the  value  of  current  policy,  we  identify  the  set  of  useful  joint 
nodes.  Formally,  a  joint  node  q  of  joint  controller  rr  is  useful  for  belief  b  £  B  if  it 
maximizes  the  value  at  the  belief.  Intuitively,  the  useful  nodes  are  the  candidate 
initial  nodes  if  the  system  starts  at  the  state  distributions  dictated  by  the  belief  b. 
Once  we  have  identified  the  sot  of  useful  joint  nodes,  we  examine  the  reachability 
of  all  the  joint  nodes  from  the  useful  joint  nodes.  Table  2  shows  the  overall 
pseudo-code  for  finding  the  set  of  reachable  joint  nodes  given  the  set  B  of  beliefs. 
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Tabic  4.  The  linear  program  in  point-based  DEC-BPI  for  improving  unreachable  node 
qt  with  respect  to  local  history  /i,  in  belief  b.  P(sf ,  z\s.a)  is  a  shorthand  notation  for 
T(s,a,  s')0(s',  a,  z). 


Variables:  3'(tiiy  Zi,q[) 

Objective:  Maximize 

Y.h^i.,b(h*'h-t,s)  P(a-i\fl>{h-i))\x(<h)Il(s, a)  +  7^2,,,  x(ai,  zitq[) 

P{q-i  |  fb(h-i),a.  i,  Z-.i)P(s\  z|s,  3)Vtf,  s')] 

Unreachability  maintenance  constraints:  V/i_i,.s  with  b(ht,  h-i,  s)  >  0  and  V(q,sf) 

C 

Y,n,zP(tl-'\fb(h  -•))*(«<.  Zi,q'i)P{<l'-%\fbV>-  i),a-i, z-i)T(s,a, s')0(s’ ,a, z)  =  0 

Probability  constraints  as  in  Table  3 


3.3  Modified  Policy  Improvement 

As  in  DEC-BPI,  our  algorithm  randomly  selects  one  of  the  node  of  an  individual 
controller  and  uses  an  LP  solver  to  find  new  parameter  values  that  improve  the 
controller.  The  joint  node  q  is  defined  to  bo  reachable  if  there  exists  state  s  such 
that  {<7,  s)  €  C.  and  the  node  qt  is  defined  to  be  reachable  if  there  exists  q  ^  such 
that  q  —  {qi,q-i)  is  reachable.  If  the  node  qt  selected  for  improvement  is  reach¬ 
able,  we  solve  the  LP  shown  in  Table  3.  Note  that  the  rnonotonic  improvement 
is  only  concerned  with  reachable  joint  nodes  and  states.  On  the  other  hand,  if 
the  selected  node  qt  is  unreachable,  we  solve  the  LP  shown  in  Table  4.  The  LP 
essentially  tries  to  make  the  selected  node  useful  for  some  belief. 

4  Experiments 

We  implemented  all  three  algorithms  discussed  in  this  paper:  DEC-BPI,  NLO,  and 
point-based  DEC-BPI.  Wc  used  two  DEC-POMDP  problems  for  the  experiments: 
decentralized  tiger  [5]  and  box-pushing  [7].  Our  implementation  of  DEC-BPI  was 
actually  the  biased  version  of  DEC-BPI,  which  takes  into  account  the  reachabil¬ 
ity  of  joint  nodes  and  states  by  computing  the  occupancy  distribution.  We  also 
accordingly  modified  the  LPs  for  point-based  DEC-BPI  using  the  occupancy  dis¬ 
tribution  in  order  to  favor  biased  policy  improvement.  The  beliefs  were  collected 
from  randomly  instantiated  1-stcp  and  2-step  tree  policies.  The  number  of  beliefs 
for  each  problem  is:  1 1  for  decentralized  tiger  and  20  for  box-pushing. 

We  ran  each  algorithm  on  each  problem,  starting  from  randomly  instantiated 
stochastic  controllers  with  varying  number  of  nodes.  We  executed  20  runs  for 
each  controller  size,  measuring  the  value  and  the  wall  clock  time  of  each  run. 

Figure  1  and  2  show  the  value  and  time  results  oil  the  problems.  Point-based 
DEC-BPI  was  able  to  yield  controllers  that  attain  values  much  higher  than  those 
from  DEC-BPI,  while  taking  a  fraction  of  time  compared  to  NLO.  Note  that  in 
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Fig.  1.  Value  and  time  results  on  the  decentralized  tiger  problem 


Fig.  2.  Value  and  time  results  on  the  box- pushing  problem 

the  ease  of  box-pushing,  we  believe  that  we  need  more  nodes  than  reported  here 

in  order  to  have  the  performance  comparable  to  NLO,  since  there  are  many  more 

reachable  beliefs  than  other  problems. 
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Abstract.  Chinese  named  entity  recognition  is  a  challenging,  difficult, 
yet  important  task  in  natural  language  processing.  This  paper  presents 
a  novel  approach  based  on  a  hierarchical  hybrid  model  to  recognize  Chi¬ 
nese  named  entities.  Three  mutually  dependent  stages  boosting,  Markov 
Logic  Networks  (MLNs)  based  recognition,  and  abbreviation  detection 
are  integrated  in  the  model.  AdaBoost  algorithm  is  utilized  for  fast 
recognition  of  simple  named  entities  first.  More  complex  named  entities 
are  then  piped  into  MLNs  for  accurate  recognition.  In  particular,  the  left 
boundary  recognition  of  named  ent  ities  is  considered.  Lastly,  special  care 
is  taken  for  classifying  the  abbreviated  named  entities  by  using  the  global 
context  information  in  the  same  document.  Experiments  were  conducted 
on  People's  Daily  corpus.  The  results  show  that  our  approach  can  im¬ 
prove  the  performance  significantly  with  precision  of  94.38%,  recall  of 
93.89%,  and  F^~\  value  of  93.97%. 


1  Introduction 

Named  entity  recognition  (NER)  is  widely  acknowledged  as  one  of  the  central 
tasks  in  natural  language  processing  (NLP).  The  essential  goal  of  NER  is  to 
identify  and  classify  certain  proper  nouns,  such  as  person  names  (PER),  organi¬ 
zations  (ORG),  locations  (LOC),  and  so  on.  NER  has  attracted  much  attention 
in  the  research  community  for  a  long  time.  Sun  et  al.  proposed  a  class-based  lan¬ 
guage  model  to  Chinese  NER  using  different  models  to  identify  different  types 
of  name  entities  (NEs)  in  Chinese  text[l].  Yn  et  al.  successfully  used  a  high- 
performance  boosting  algorithm  to  handle  the  Chinese  NER  task[2].  However, 
the  results  remain  unsatisfactory. 

To  this  end.  a  novel  approach  based  on  a  hierarchical  hybrid  model  is  proposed 
to  recognize  Chinese  NEs.  Three  mutually  dependent  stages,  namely,  boosting, 
Markov  Logie  Networks  (MLNs)  based  simple  recognition,  and  abbreviated  NEs 
detection  arc  integrated.  Tiie  experimental  results  indicate  that  the  hierarchical 
hybrid  approach  can  improve  the  performance  significantly.  In  Section  2,  the 
hierarchical  hybrid  model  is  presented.  Experiments  and  results  are  given  in 
Section  3.  Section  4  is  concluding  remarks. 
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2  The  Hierarchical  Hybrid  Model 

2.1  Simple  Named  Entities  Recognition 

Boosting  is  chosen  to  recognize  the  simple  NEs  based  on  two  reasons.  Firstly’,  a 
logical  semantic  and  syntactic  unit,  in  natural  language  is  the  word.  The  charac¬ 
ter  is  a  basic  written  unit  in  Chinese  language  and  has  no  real  meaning.  Conse¬ 
quently",  word  segmentation  is  the  fundamental  task  which  transforms  a  Chinese 
character  string  into  word  sequence.  Another  reason  is  to  gain  the  high  accuracy 
performance  of  simple  Chinese  NEs.  Boosting  technique  in  machine  learning 
can  meet  both  requirements  in  Chinese  NER.  Since'  Yu  et  al.  have  successfully 
applied  the  high-performance  boosting  technique  called  AdaBoost.MH  to  the 
Chinese  NER  task  [2,3],  we  use  this  technique  directly. 

2.2  Complicated  and  Compound  Named  Entities  Recognition 

While  the  boosting  algorithm  can  identify  many  simple  NEs,  some  organizations 
and  locations  are  difficult  to  identify  due  to  lack  of  linguistic  knowledge.  Take  the 
organization  t,,;B£C.iS$Srfi®tffi:/Chongqing  Municipality  Government”  as  an  ex¬ 
ample,  /Chongqing”  is  the  name  part  of  location,  w]‘if^T)i/MimicipalityM 
is  the  salient  word  of  location,  and  /Government”  is  the  salient  word  of 

organization;  the  three  parts  can  conjunct,  to  an  integral  as  a  complicated  or¬ 
ganization.  Through  the  above  observations,  we  incorporate?  human  knowledge 
via  MLNs  to  validate  the  boosting  NER  hypotheses  [3].  Since  MLNs  can  easily 
transform  some  linguistic  knowledge  into  first-order  logit'  formulas,  MLNs  were 
chosen  to  recognize  the  complicated  and  compound  NEs  [4,5,6]. 

Now  we  present  an  MLN  for  our  task.  The  main  evidence  predicate  in  the 
MLN  is  TaggedEntit y(te\  i \e).  which  is  true  iff  tagged  entity  Ic  appears  in  the 
ith  position  of  the  cth  sentence  Punctuation  marks  are  not  treated  as  separate 
tagged  entities;  rather,  the  predicate  HasPunc(c,i)  is  true  iff  a  punctuation 
mark  appears  immediately  after  the  ith  position  of  the  cth  sentence.  The  pred¬ 
icate  SalientWovd(te.  $w,  i ,c)  is  true  iff  tagged  entity  tr  in  the  ith  position  of 
the  cth  sentence  ending  with  salient  word  s u\  which  sw  E  { LocSalientWord , 
OrgSalientWovd}.  The  query  predicates  are  TnFwid(f,isc ).  InFicld(f,ifc)  is 
true  iff  t  he  ith  position  of  the  cth  sentence  is  part  of  field  /.  where  /  E  {Location. 
Organization},  and  inferring  it  perforins  recognition. 

Now  we  describe  our  recognition  model  via  MLN.  Generally,  different  types  of 
NEs  have  different  structures  [7].  Typical  structures  of  location  and  organization 
are  as  follows: 


Person — ►  {[iast  name] [first  name]} (title  wan'd) 

Location  — ♦  (name  part)  *  (salient  word) 

Organization  — ►  {[person  name]\plaee  narne][kerncl  name][org.  name]} 
*[org.  name] (salient  wan’d) 

Here  ()*  means  repeating  one  or  several  times.  {}*  means  selecting  at  least  one 
of  items.  Meanwhile,  since  the  left  boundary  of  Chinese  NEs  is  more  difficult  to 
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recognize  than  right  boundary  due  to  lack  of  the  salient  word*  both  identifiers  of 
()*  and  {}*  need  to  be  used  to  segment  effectively  the  left  boundary  of  Chinese 
NEs.  Therefore,  all  the  above  linguistic  knowledge  can  be  represented  by  these 
rules  of  such  forms  as  follows: 

TaggedEntity(+tc ,  i,  c)  A  SalicntW ord(+te,  -Ks w,  ?',  c)  =»  InField(+ f,  c) 

3te'T  agged  Entity  (+tcf ,  i,  c)  A  TaggcdEntity(+t  c ,  7  4- 1,  c)  A  SalientWord 
(+tc,  +sw,  i  4-  1,  c)  A  - yHasPuiic(e ,  ?)  =>  IiiFicld(+ /.  i,  c) 

Furthermore,  we  make  the  following  hard  constraint  to  recognized  NEs.  Firstly, 
all  kinds  of  tagged  entities  are  within  25  Chinese  characters.  Secondly,  since 
all  NEs  are  proper  nouns,  the  tagged  entities  should  end  with  noun  words. 
Then  we  define  three  evidence  predicates,  which  are  NamcdEntity(ne,i,c ), 
LengthEntity(lawJ25,ne,c)  and  7ic,  c),  respectively.  NarnedEnt 

—  ity(ne,  i,c)  is  true  ilf  NE  nc  appears  in  the  7th  position  of  the  cth  sentence. 
Lei  igth  Entity  (l  owJ25,nc,c)  is  true  iff  NE  nc  appeared  in  the  cth  sentence  is 
lower  than  or  equal  to  25  Chinese  characters.  And  the  EiidWith(nw,  lie,  c)  is 
true  iff  NE  nc  appeared  in  the  cth  sentence  ends  with  noun  word.  Both  con¬ 
straints  are  represented  by  a  simple  rule: 

N amed Entity (nc,i,c)  =>  LeTigthEiitity(loivJ2?),ne,c)  A  End\Vith(nwf  nc,c) 

After  constructing  these  rules,  MLNs  can  learn  and  perform  inference  to  recog¬ 
nize  complicated  and  compound  NEs. 

2.3  Named  Entity  Abbreviation  Recognition 

After  recognizing  the  complicated  or  compound  Chinese  NEs,  there  may  still 
exist  some  abbreviated  NEs  which  are  difficult  to  identity.  The  aim  in  this  stage 
is  to  further  improve  the  accuracy  by  recognizing  the  abbreviated  NEs.  In  a  doc¬ 
ument,  some  NEs  usually  appears  in  the  abbreviated  formats  in  the  latter  text 
after  firstly  appearing  in  the  full  format.  This  enhance  the  difficulty  of  recogni¬ 
tion.  Fortunately,  the  global  feature  from  the  same  document  may  play  a  key 
role  for  recognizing  abbreviated  NEs.  A  method  to  detect  abbreviated  Chinese 
NEs  can  be  described  as  follows:  First,  constructing  a  Static  Chinese  Named 
Entity  List  (SCNEL)  and  recognized  NEs  are  stored  in  the  SCNEL.  Second, 
constructing  a  Dynamic  Candidate  Word  List  (DCWL)  and  all  candidate  words 
arc  stored  in  the  DCWL.  Third,  every  candidate  word  in  the  DCWL  has  a  cor¬ 
responding  feature  and  initially  this  feature  is  set  to  0.  When  candidate  word  is 
the  random  conjunctions  of  one  or  more  characters  of  the  Chinese  NEs  in  the 
SCNEL,  the  feature  is  set  to  1. 

3  Experiments  and  Results 

Experiments  were  conducted  using  People's  Daily  Corpus  of  January  1998 
(http://icl.pku.edu.cn/icl_groups/corpus/dwldforml).  It  is  tagged  with 
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(a)  PER  type  (b)  LOC  type  (c)  ORG  type 

Fig.  1.  Precision  and  recall  curves  for  different  NE  types 


Table  1.  Chinese  NER  performance  based  on  the  hierarchical  hybrid  model  in  three 
mutually  dependent  stages 


Approach 

Precision 

Recall 

f'H  1 

All 

Left 

Right 

All 

Left 

Right 

All 

Left 

Right 

PER 

90.46 

87.48 

89.67 

89.32 

87.40 

90.49 

90.39 

87.28 

89.51 

Boosting 

LOC 

81.25 

80.29 

81.55 

80.94 

79.84 

80.99 

79.15 

79.20 

80.35 

ORG 

79.84 

76.86 

79.94 

66.24 

65.34 

66.45 

72.36 

71.34 

72.42 

Total 

79.84 

81.30 

83.41 

81.75 

78.31 

80.89 

82.87 

80.84 

82.88 

PER 

97.46 

96.48 

93.67 

97.32 

97.40 

97.49 

96.94 

96.40 

96.81 

Boosting-!- MLNs 

LOC 

92.86 

91.67 

92.01 

94.37 

93.41 

93.47 

93.75 

91.73 

92.20 

ORG 

90.06 

89.78 

90.15 

89.97 

89.79 

90.11 

90.01 

89.73 

90.08 

Total 

93.15 

93.15 

93.20 

93.24 

93.21 

93.30 

93.06 

93.05 

93.11 

Boosting-!  MLNs 

PER 

97.52 

97.12 

97.13 

97.37 

97.05 

97.09 

97.46 

97.01 

97.03 

LOC 

93.21 

92.52 

92.53 

94.53 

93.78 

94.07 

94.16 

94.09 

94.12 

-fGlobal  Feature 

ORG 

90.47 

90.41 

90.43 

90.79 

90.70 

90.73 

90.67 

90.66 

90.69 

Total 

94.38 

94.32 

94.39 

93.89 

93.76 

93.79 

93.97 

93.49 

93.56 

Note:  “All”  represents  both  of  boundaries,  “Left”  left  one  and  “Right”  right  one. 


POS  according  to  Chinese  Text  POS  Tag  Set.  The  Corpus  contains  26  million 
characters.  The  first  half  of  the  corpus  is  used  as  the  training  corpus,  and  testing 
corpus  is  corresponding  to  the  later  half  month  of  January  1098.  respectively.  The 
parameters  in  the  experimental  evaluation  include  precision,  recall  and  j. 

We  first  used  the  decision  stump  as  the  weak  classifier  in  boosting  to  recog¬ 
nize  the  simple  NEs.  And  we  ran  the  AdaBoost.MH  for  5000  rounds.  Then  we 
took  the  learning  and  inference  algorithms  provided  in  the  open-source  Alchemy 
package  to  recognize  the  complicated  and  compound  NEs[5].  We  performed  dis¬ 
criminative  weight  learning  using  the  voted  perception  algorithm,  and  inference 
using  the  MC-SAT  algorithm  In  MLN  weight  learning,  we  used  100  iterations 
of  gradient  descent  and  chose  the  default  values  except  it.  The  total  learning 
time  reached  to  24  hours.  In  inference,  we  ran  MC-SAT  for  20  hours.  Finally,  we 
constructed  a  SCNEL  and  a  DCWL  to  help  recognize  the  abbreviated  NEs  and 
used  java  language  to  program  a  matching  recognition  code.  The  experimental 
platform  is  on  a  server  with  two  CPUs  at  2.8  GHz  and  4GB  of  memory.  The 
experimental  results  are  shown  in  Figure  1  From  these  figures,  we  can  conclude 
that,  the  results  using  the  combination  of  boosting  and  MLNs  are  clearly  more 
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accurate  than  those  of  the  boosting  method,  and  MLNs  significantly  improve  the 
performance  of  accuracy.  Furthermore,  although  our  hierarchical  hybrid  method 
trivially  surpasses  the  combination  of  boosting  and  MLNs  for  PER,  our  method 
perform  well  in  LOC  and  ORG. 

As  shown  in  Table  1,  the  hierarchical  hybrid  model  achieves  satisfactory 
improvements  with  precision  of  94.38%,  recall  of  93.89%,  and  F$=\  value  of 
93.97%.  Resides,  the  difference  between  left  boundary  and  right  one  falls  from 
2%  in  the  first  stage  to  0.5%  in  the  third  stage.  It  means  that  through  linguistic 
knowledge  MLNs  perform  well  in  left  boundary  recognition. 

4  Conclusion 

A  novel  approach  based  on  hierarchical  hybrid  model  was  proposed  to  recognize 
Chinese  NEs,  This  model  incorporates  three  mutually  dependent  stages  into  a 
unifying  framework.  Experiments  were  conducted  on  People’s  Daily  corpus.  The 
results  show  that  our  approach  can  significantly  improve  the  performance  and 
achieves  a  fairly  satisfactory  result.  Future  work  is  to  extend  this  approach  to 
larger  datasets. 
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Abstract,  Word  segmentation  is  an  essential  step  in  building  natural  language 
applications  such  as  machine  translation,  text  summarization,  and  eross-lingual 
Information  retrieval.  For  certain  oriental  languages  where  word  boundary  is  not 
clearly  defined,  a  recognition  process  can  become  very  challenging.  One  of  the 
serious  problems  is  dealing  with  word  ambiguity.  In  this  paper,  we  investigate  the 
use  of  Linear  Support  Vector  Machines  (LSVM)  for  word  boundary 
disambiguation.  We  empirically  show',  in  the  Vietnamese  case,  that  LSVM 
obtains  a  better  result  when  comparing  to  the  Trigram  language  Model  approach 

Keywords:  Word  Segmentation.  Ambiguity  Resolution,  Covering  Ambiguity 
Resolution,  Trigram  Language  Model,  and  Linear  Support  Vector  Machines. 


1  Introduction 

In  Oriental  languages,  there  are  no  explicit  word  separators  such  as  space  as  in 
English  to  indicate  word  boundaries.  Word  segmentation  is  a  process  of  dividing 
written  text  into  meaningful  units,  such  as  words.  There  are  two  common  sub¬ 
problems  with  word  segmentation:  (I)  out-of-vocabulary  (OOV)  words  identification 
and  (2)  ambiguity  resolution. 

In  a  sequence  of  Vietnamese  syllables,  S,  composing  of  two  syllables  A  and  B 
occurring  next  to  one  another,  if  S,  A,  and  B  are  words  each,  then  there  is  a  covering 
ambiguity  in  S.  For  example,  the  two  syllables  string  “Nhat  ky"  can  be  interpreted  as 
three  different  words  in  the  two  sentences  below. 

Sentence  I:  Nhat  (Japan)  I  ky  (signs)  I  hiep  dinh  (agreement)  I  ve  (about)  I  giant 
(reduce)  I  khf  thai  (gas  emission)  I  nha  kwh  (greenhouse). 

Sentence  2:  Nhat  ky  (diary)  I  dai  (life)  I  sinh  vien  (student). 

From  the  given  example,  we  observe  that  making  a  right  word  choice  is  not  a  trivial 
task  for  a  computer.  This  determination  has  to  be  obtained  from  knowing  the  context 
of  a  sentence  where  a  word  is  used. 

B.-T.  Zhang  and  M.A  Orgun  (Eds.):  PRICAI  2010,  LNA1  6230.  pp.  625-630.  2010. 
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2  Related  Works 

For  Vietnamese  language,  researches  in  word  ambiguity  resolution  are  still  at  an  early 
stage.  The  work  of  Le  et  al  [3]  attended  to  overlapping  ambiguity  except  the  covering 
ambiguity  problem.  This  work  has  an  impressive  precision  and  recall  rates  at  95%  and 
96.3%  respectively.  Nguyen  [2,4],  in  WebSBA,  employed  Web  data  for  word 
segmentation.  This  work  addressed  ambiguities  resolution  using  bigram  language 
model  and  word  collocation  concepts.  Its  result  was  compared  against  another  popular 
Vietnamese  Word  Segmentation  approach  -  the  JVnSegmcnter  [5].  There  was  no 
discussion  on  how  ambiguities  are  handled  in  [5].  WebSBA  had  a  precision  and  recall 
rates  at  89%  and  82%. 

The  Chinese  word  segmentation  has  a  similar  covering  ambiguity  problem.  Xiao  et 
al  [6]  regarded  the  covering  ambiguity  problem  to  word  sense  disambiguation.  They 
used  vector  space  model  to  formulate  the  contexts  of  ambiguous  words.  For  90 
frequent  words,  the  authors  manually  trained  77,654  sentences.  They  obtained  a 
96.58%  accuracy  rating.  Recently,  Su-qin  Feng  [1]  collected  contextual  information 
statistics  of  covering  ambiguous  words  and  found  a  context  calculation  mode  by  using 
log  likelihood  ratio.  Fourteen  frequently  appeared  covering  ambiguous  words  are  used 
for  evaluation.  The  highest  evaluation  accuracy  rate  reaehes  95.60%. 

3  Proposed  Approaches 

A  segmenter  produces  a  segmentation  result.  Because  of  using  a  Longest  Matching 
strategy  [2,4]  in  favoring  compound  words,  a  covering  ambiguity  error  might  still 
exist  within  the  segmented  text.  We  examine  segmented  disyllabic  words  in  a 
sentence,  in  a  left  to  right  order,  for  a  potential  ambiguous  word.  If  a  word  matches  a 
mle  description  in  section  3.1,  we  formulate  a  text  chunk  consists  of  this  word  and  its 
context  words  in  its  neighborhood  (about  ±  4  words).  This  text  is  evaluated  using  a 
disambiguating  module  detailed  in  sections  3.2  or  3.3  below. 

3.1  Patterns  for  Ambiguity  Detection 

The  main  difference  between  our  work  and  the  previous  works  is  to  process  unknown 
ambiguous  words  and  to  resolve  them.  We  defined  the  following  patterns,  denoted  in 
BNF,  whieh  could  contribute  to  ambiguity  cheeking  process: 

•  <syllable-preposition>  ::=  <syllable>  <clw  (for)  I  d  (at)  I  tren  (above)  I  \  &i 
(with,  to)  I  trong  (inside)  I  ...  >.  For  example,  the  word  “ trfing  trong 

•  <noun_pronoiin-verbs>  ::=  cnoun  I  pronoun>  <  la  (to  be)  I  lam  (to  do)  I  <ti  (to 
go)  I  can  (to  need)  I ...  >.  For  example,  the  word  “ nguai  lam\ 

•  <syllable-pronoun>  ::=  <syllablc>  <  Selected  Pronouns:  first,  second  person,  or 
kinship  terms:  toi  (I)  I  anh  (you)  I  ...  >.  For  example,  the  word  “ dan  anil ”, 

•  <Irregularity_in_word_shape>  ::=  <first_letter_lower_ease(syllable)> 
<first_letter_upper_ease(syllable)  >.  For  example,  the  word  “ ba  Ba". 

3.2  Trigram  Language  Model 

A  language  model  is  usually  formulated  as  a  probability  distribution  p(s)  over  a  string 
s  that  attempts  to  reflect  how  frequently  a  string  s  occurs  as  a  sentence  in  a  corpus  [7]. 
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We  estimate  the  trigram  probabilities  using  a  large  corpus  of  text  using  their  trigram 
frequencies  formulated  in  equation  1: 


C(h',hO 


(1) 


where  Cis  the  count  of  sequences  W1VV-,VV’;t  and  VV,  W-,  appearing  in  a  eorpus.  The 
trigram  probabilities  in  a  eorpus  can  be  estimated  linearly  as  follows: 


p  ( H’j  l  vr, ,  w2 )  =  a}  ( \vy  I  vi , ,  w2 )  +  a2  p 2  ( n  j  I  w2 )  +  at  p ,  ( tv3 ) 


(2) 


where  (X  is*  a  tuning  parameter,  or  a  weight,  with  CC  €=  |0,  1  ].  We  find  0<  OC^ ,  CC2^0^ 

<1  by  optimizing  on  “held-out”  data.  A  string,  with  a  higher  probability  seore,  is 
expected  to  eontain  a  corrected  segmented  word(s). 


3.3  Linear  Support  Vector  Machines  (LSVMs),  Text  Representation  and 
Feature  Selection 

Many  problems  in  natural  language  processing  ean  be  categorized  as  classification 
problem.  Covering  ambiguity  resolution  can  be  regarded  in  the  same  fashion.  That  is 
if  an  ambiguous  word  should  be  divided  into  two  individual  words,  a  separated 
condition  (-1)  or  combined  as  a  single  word,  a  combined  condition  (+1).  We  will  use 
LSVM  [  10)  to  determine  it. 

Each  chunk  of  text  (a  text  chunk),  consists  of  an  ambiguous  word  and  other  context 
words  nearby  (±  4  words),  and  is  represented  as  a  vector  of  words.  For  text 
classification  simpler  binary  feature  values  (i.e,  a  word  either  occurs,  a  1  value,  or 
does  not  oeeur,  a  0  value)  are  often  used  instead  [  10].  We  also  eliminate  noise  words 
such  as  “v«”  (and),  (of),  “A7/r  (when),  ete.  These  words  are  not  essential  to  an 

overall  context  of  a  text  chunk. 


3.4  Learning  Support  Vector  Machines 

Wc  arc  motivated  by  a  method  called  active  learning  [9]  in  suggesting  which  text 
from  the  pool  to  use  in  the  learning  process.  This  feature  enables  a  reduction  in 
utilizing  human  resources  for  training  of  examples.  Our  algorithm  of  a  pool-based 
active  learning  is  described  as  follows: 

Algorithm  ActiveLearner 
Inputs : 

Pool  of  unlabeled  examples;  Initial  Classifier; 

Output : 

Updated  classifier; 
begin 

(1)  Classify  all  unlabeled  examples; 

(2)  Separate  examples  into  2  partitions  where 
each  example's  decision  score  has  a  closest 
distance  to  its  partition's  centroid  value; 

(3)  Trainer  trains  the  classifier  with  new 
labeled  examples  in  a  partition  which  has  a 
higher  centroid  value; 

end. 
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In  the  aforementioned  algorithm,  the  pool  of  unlabeled  examples  consists  of  text 
chunks  collected  from  the  Web  via  a  Yahoo!  web  service.  Collected  examples  are 
classified  to  obtain  their  decision  values  using  SVMlignt  [8].  We  then  separate  these 
examples  into  two  partitions  where  each  contains  examples  having  decision  values 
closest  to  its  centroid.  A  centroid  value  of  a  partition  is  obtained  by  taking  average  of 
the  decision  values  of  its  examples.  Finally,  a  trainer  trains  only  examples  belonging 
to  an  upper  partition  which  has  a  higher  centroid  value.  Figure  1  shows  two  scatter 
plots.  Plot  1  contains  all  unlabeled  examples  in  a  pool  intended  for  training.  Plot  2 
contains  a  much  reduced  pool  of  unlabclcd  examples  suggested  by  the  system  for  a 
trainer  to  train. 


Fig.  1.  Scatter  Plot  1  (left)  contains  all  examples  requested  for  training.  Scatter  Plot  2  (right) 
contains  actual  selected  examples  to  be  trained  by  a  trainer, 

Heuristically,  we  choose  to  select  examples  located  in  an  upper  partition,  examples 
shown  in  Plot  2,  since  these  examples  are  expected  to  change  the  maximum  margin 
hyperplane  the  most  [9]. 

3.5  Classification  by  Support  Vector  Machines 

Once  a  covering  ambiguous  word  is  detected  by  a  word  segmentation  algorithm,  a 
text  chunk  is  formulated.  This  text  chunk  contains  the  ambiguous  word  itself  and 
neighboring  words  in  about  ±  4  words.  Two  possible  interpretations  of  this  word, 
combining  or  splitting  form,  arc  the  possible  outcomes  of  the  SVMllght' s  classifier. 
Here  is  an  example  of  queried  vectors: 

0  57022:1  67833:1  70589:1  7525  \ :  1  #  nhgt  I  ky  I  ddi  I  sinh  vien 
0  44403:1  67833: 1  7525 1:1#  nhgt  ky  I  d<ri  I  sink  vien _ 

The  SVMUght  returns  two  real  values,  of  a  decision  function,  for  each  examined 
vector  respectively: 


0.99076741 

0.99981328 


We  implement  the  following  decision  table  in  finalizing  a  decision  given  decision 
values  from  the  SVMUght . 
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Decision  Values  from  SVM 113 ht 

Vector  combined 
word  form 

Positive 
Value,  fix) 

Negative 

Value, 

fix) 

Positive  Value, 
f(x) 

Negative  Value,  fix) 

Vector  separated 
words  form 

Positive 
Value,  fiy) 

Negative 

Value, 

fiy) 

Negative  Value, 

fiy) 

Positive  Value,  fiy) 

Final  Decision 

Combine 

word. 

Separate 

words. 

If  fix)  >lf(y)l  =: 
Combine  word; 
Separate  words 
otherwise. 

Take  the  combined 
form  (default 
behavior). 

4  Experiments 

We  used  WebBSA  [2]  as  our  tested  segmented  We  evaluated  three  methods:  (1) 
Longest  Match  Rule;  (2)  Trigram  language  model;  and  (3)  LSVM.  We  started  with  a 
raw  text  corpus  of  166,484  text  titles  to  estimate  trigram  frequencies.  The  data  is 
serving  for  our  corpus  need  in  building  trigram  frequencies  and  training  of  its  model. 
From  the  same  list,  we  randomly  took  10,300  document  text  titles,  for  ease  of  data 
extraction,  and  performed  word  segmentation,  with  method  (1).  We  also  located 
ambiguous  words  using  word  patterns  (section  3.1).  We  identified  and  learned  1,535 
texts  having  about  120  potential  covering  ambiguous  words  included.  This  low 
number  could  be  a  reflection  as  observed  from  [3].  Using  120  ambiguous  words,  we 
fetched  the  Web  to  obtain  another  1,153  texts.  With  active  learning,  only  675  texts 
(about  58%)  were  selected  for  the  training.  From  the  same  pool,  we  also  randomly 
selected  another  set  of  texts  for  non-active  learning.  We  learned  these  examples  in 
addition  to  the  1,535  examples  trained  earlier.  In  our  final  task,  we  fetched  the  Web 
with  another  unknow  n  3,174  texts.  These  texts  are  serving  as  our  unseen  test  set.  We 
identified  the  result  as  follows: 


Table  1.  Hvaluations  with  test  set. 


Evaluation  Category 

Accuracy 

Rule  based  approach  (Longest  Matching 

Rule) 

80.4% 

Trigram  Language  Model 

77.2% 

LSVM  with  learning  of  random  examples 

88.8% 

LSVM  with  active  learning 

92.5% 

The  above  result  indicates  that  the  SVM  with  active  learning  approach  outperforms 
all  the  other  approaches  with  unknown  test  set  data. 

5  Conclusion 

Two  possible  approaches  to  disambiguate  covering  word  ambiguity  have  been 
described  for  languages  where  word  boundary  is  not  clearly  defined.  Our  test  result 
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confirms  that  the  Learning-based  approach  has  advantages  in  term  of  flexibility, 
better  accuracy  and  scalability.  For  the  Vietnamese  word  segmentation  works  we 
studied  [2, 3, 4,5],  we  believe  that  this  is  a  first  attempt  to  address  the  covering 
ambiguity  condition  specifically.  For  the  future  work,  we  plan  to  increase  the  scope  of 
experiment  to  increase  larger  volume  of  held  out  data  for  testing.  We  are  also  looking 
into  a  possibility  to  integrate  this  concept  to  address  the  overlapping  ambiguity 
condition. 
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Abstract.  In  this  paper,  we  present  some  approaches  to  diacritics 
restoration  in  Vietnamese,  based  on  letters  and  syllables.  Experiments 
vvitli  language-specified  feature  selection  are  conducted  to  evaluate  con¬ 
tribution  of  different  types  of  feature.  Experimental  results  reveal  that 
combination  of  Adaboost  and  C4.5,  using  letter-based  feature  set, 
achieves  94.7%  accuracy,  which  is  competitive  with  other  systems  for 
diacritics  restoration  in  Vietnamese.  Test  data  for  diacritics  restora¬ 
tion  task  in  Vietnamese  could  be  freely  collected  with  simple  prepro¬ 
cessing,  whereas  large  test  data  for  many  natural  language  processing 
tasks  in  Vietnamese  is  lack.  So.  diacritic  restoration  could  be  used  as  an 
application-driven  evaluation  framework  for  lexical  disambiguation  tasks. 

Keywords:  Lexical  disambiguation,  diacritics  restoration,  decision  tree, 
boosting,  word  segmentation  feature  space. 


1  Introduction 

The  aim  of  diacritics  restoration  is  to  restore  original  script  from  diacritic-free 
script  by  correct  insertion  of  diacritics.  Subjects  of  diacritics  restorat  ion  .are  lan¬ 
guages  containing  diacritics,  such  as  French,  Spanish.  Dutch,  Vietnamese,  etc. 
In  natural  language  processing,  diacritics  restoration  is  a  particular  lexical  dis¬ 
ambiguation  task.  For  example,  “xn  ly  ngon  ngu  tu  rihieir,  which  means  “natural 
language  processing”,  which  is  a  diacritic- free  script  in  Vietnamese,  will  be  re¬ 
stored  as  “xit  ly  ngbn  ngfr  tif  allien”  after  inserting  correct  diacritics. 

Though  potential  commercial  applications  (typing  assistant,  search  query  res¬ 
olution,  etc.)  could  be  found  from  diacritics  restoration,  not  many  researchers 
have  studied  oil  this  topic.  A  well-known  work  in  the  literature  is  accents  restora¬ 
tion  in  Spanish  and  French  [1  Considering  accents  restoration  as  multinomial 
classification,  Yarowsky  achieved  accuracy  of  99.7%  on  the  full  task  using  de¬ 
cision  list  learning.  In  a  comparative  work  ,  Mihaleea  compared  learning  from 
letters  and  learning  from  words  for  diacritics  restoration  in  Romanian  [2].  Con¬ 
clusion  from  this  paper  is  that  learning  from  letters,  which  is  comparable  with 
learning  from  words,  could  be  applied  to  resonrces-scarce  languages.  In  an  at¬ 
tempt  to  use  dictionary  in  combination  with  learning  from  words,  accuracy  over 
90%  was  achieved  in  [6].  It  ’s  worth  to  notice  that  training  data  and  test  data  for 
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supervised  learning  of  diacritics  restoration  are  easily  collected  from  Internet.  It 
is  opposite  to  other  lexical  disambiguation  tasks,  where  annotation  of  training 
data  and  test  data  is  the  most  time  consuming  phase. 

To  our  knowledge,  grapheme  based  and  word  based  methods  in  [5]  and  con¬ 
strained  sequence  classification  based  method  in  [7]  are  the  only  researches  on 
diacritics  restoration  in  Vietnamese.  In  [5],  accuracy  of  memory-based  classifiers 
reaches  63.1%  and  82.7%  with  learning  from  words  and  learning  from  graphemes 
in  that  order.  In  that  paper,  lexical  diffusion,  which  is  the  division  of  tokens  con¬ 
taining  diacritics  and  all  tokens,  is  claimed  to  measure  the  difficulty  of  diacritics 
restoration  task  in  a  specific  languages.  According  to  this  measurement,  Viet¬ 
namese  is  the  second  (after  Yoruaba)  in  14  studied  languages. 

In  this  paper,  experiments  on  diacritics  restoration  in  Vietnamese  are  con¬ 
ducted  using  five  strategies:  learning  from  letters,  learning  from  semi-syllables, 
learning  from  syllables,  learning  from  words,  and  learning  from  bi-grams.  To 
focusing  on  comparing  proposed  approaches,  we  only  use  C4.5  as  classifier.  On 
the  other  hand,  to  focus  on  performance,  AdaBoost  and  C4.5  are  combined  to 
get  high  accuracy. 

The  paper  is  organized  into  four  parts.  The  first  part  briefly  introduces  cur¬ 
rent  researches  related  to  diacritics  restoration.  The  second  part  describes  in 
more  detail  linguistic  characteristics  of  Vietnamese  to  figure  out  difficulties  of 
diacritics  restoration  in  Vietnamese  in  comparison  with  other  languages.  The 
next  part  describes  the  feature  set  in  five  proposed  approaches.  In  the  last  part, 
experimental  results  are  showed  and  are  discussed  to  point  our  advantages  and 
disadvantages  of  proposed  approaches.  The  paper  ends  with  remarked  conclu¬ 
sions  and  future  works. 


2  Fundamental  Lexical  Units  in  Vietnamese 

Vietnamese  alphabet  contains  29  letters,  including  12  vowels  and  17  consonants. 
English  letters  like  [f  j,  w,  z]  are  not  included,  where  as  [a,  a],  [e].  [6,  efj.  [e],  [if] 
and  [d]  are  variants  of  [a],  [e],  [o],  [u].  [d]  in  that  order.  In  writing  language  and 
speaking  language,  tone  marks  are  added  to  vowels  to  adjust  different  tones. 

Diacritics  restoration  in  Vietnamese  must  resolve  two  kinds  of  ambiguity:  pho¬ 
netic  diacritics  ambiguity  (e.g.  between  [a],  [a],  and  a])  and  tonic  aceents  ambigu¬ 
ity  (e.g.  between  [a],  [a],  [a],  [a],  [a],  and  [a]).  Considering  diacritics  restoration 
as  multinomial  classification,  combination  of  phonetic  diacritics  and  tonic  ac¬ 
cents  ambiguities  is  one  of  the  reasons  that  makes  the  task  in  Vietnamese  more 
difficult  than  in  other  languages. 

Vietnamese  is  a  monosyllable  language.  For  example,  in  the  phrase  uxif  ly 
ngon  ngif  tir  nhien”  (natural  language  processing),  tokens  which  can  be  separated 
by  space  are  syllables.  Raw  text  in  Vietnamese  does  not  contain  explicit  words 
boundary.  Word  segmentation  is  the  task  of  defining  this  boundary.  For  example, 
above  phrase,  uxif  ly  ngon  ngff  t  u  nliicn”,  as  input  of  a  word  segmentation  system 
should  have  output  as  "[xif  ly]  [ngon  ngif]  [tif  nhien]”,  where  words  boundaries 
are  explicit.  Sequence  of  syllables  in  nach  pair  of  brackets  indicates  a  word.  A 
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word  may  contain  one  or  more  syllables  (normally  two  syllables).  Word  segmen¬ 
tation  is  an  important  preprocessing  phase  of  raw  text  before  applying  lexical 
disambiguation  systems  or  information  retrieval  systems. 

Table  1.  Ambiguity  in  diacritics  restoration  in  Vietnamese 
Letters  Classes 

a  [a.  a,  a.  a.  a,  a.  A.,  A,  A.  A.  A,  A.  A,  a,  A.  A.  A.  a] 
e  [e,  e,  e.  e.  e.  e.  e.  e.  e.  (?,  e,  e] 
o  [o,  6,  6,  6,  6,  o,  6,  6,  6,  6.  6,  o,  cf,  0,  cf,  cl,  d.  u] 
n  [u,  u,  li  n.  u,  u.  ir,  if,  if,  if.  Of,  if] 
i  [i,  i,  i,  i  I,  j] 
y  [y,  y,  y.  y,  y,  y] 

(]  [d,  d] 

Set  of  syllables  in  Vietnamese  is  definite.  A  syllable  is  the  combination  of  a 
head  consonant  and  a  semi- syllable.  Syllable  itself  does  not  have  meaning.  It 
is  just  a  pronounceable  lexical  unit.  Pronunciation  of  a  syllable  is  decided  by 
two  components:  head  consonant,  which  is  optional,  and  semi-syllable.  Syllables 
with  the  same  semi-syllable  will  have  different  pronunciations  depending  on  bead 
consonant,  and  vice  versa.  For  example,  in  the  phrase  “x?  1?  ng?n  ng?  t?  nhi?if: 


Table  2.  Head  consonant  and  semi-syllable  as  components  of  syllable 


Syllable  Hoad 

consonant  Scrni 

-syllable 

xur 

X 

ir 

l.v 

1 

y 

ngon 

ng 

Oil 

ngff 

ng 

fr 

tijf 

t 

ir 

nhien 

nil 

ien 

There  are  28  head  consonants  and  748  semi-syllables  in  Vietnamese.  Combi¬ 
nation  of  head  consonants  and  semi-syllables  creates  21 K  syllables.  In  a  normal 
Vietnamese  dictionary,  7K  syllables  are  used  to  create  40K  words.  A  word  nor¬ 
mally  contains  two  syllables.  As  a  result,  feature  spare  in  learning  from  srini- 
syllables,  learning  from  syllables,  and  learning  from  words  remarkably  increases 
in  that  order.  This  observation  is  important  in  supervised  learning  for  lexical 
disambiguation. 

2.1  Text  Corpus 

Our  text  corpus  contains*  3.7K  articles  (2.2M  tokens)  in  education  category  of 
VnExprcss.net  from  May,  2007  to  August  ,  2008.  There  are  20K  unique  tokens  in 
the  corpus.  That  means,  in  average,  each  token  appears  about  4  times  in  all  the 
documents.  4.5I<  syllables,  which  are  used  in  Vietnamese  dictionary,  frequently 
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appear  in  t  lie  corpus  as  tokens.  Remaining  15. 5K  tokens  don't  belong  to  Viet¬ 
namese  dictionary,  each  of  which  rarely  appears  in  the  corpus.  They  are  mainly 
English  named  entities,  like  celebrities’  names,  move  titles,  song  titles,  country 
names,  locations,  terminologies,  etc.  ,  all  of  which  don’t  contain  diacritics.  Some 
tokens  containing  diacritics  arc  acronyms,  noisy  or  misspelling  text.  To  elimi¬ 
nate  effect  of  noisy  data  and  to  reduce  feature  space  in  decision  tree  learning,  all 
tokens  not  belonging  to  Vietnamese  dictionary  (out-of-vocabulary  tokens)  arc 
tagged  with  the  same  label  “UNKNOWN”. 

3  Feature  Set 

In  our  work  surrounding  context  of  the  ambiguous  pattern  is  selected  as  features. 
A  sliding  window  scanned  through  training  corpus  to  build  data  instances.  Follow¬ 
ing  popular  experiments  in  the  literature  of  lexical  disambiguation,  we  chose  the 
window  of  size  5  to  the  left  and  to  the  right  of  the  ambiguous  pattern.  The  am¬ 
biguous  pattern  is  centered  on  the  sliding  window.  No  feature  selection  or  param¬ 
eter  tuning  is  applied.  Default  parameters  of  C.45  implemented  in  Weka  are  used. 


Table  3.  Statistic  of  number  of  syllables  in  words  in  a  Vietnamese  dictionary  containing 
30k  entries 


#Syllable  in 

a  word  # Words  Percentage 

1 

5208 

17.27 

2 

22866 

75.81 

3 

1362 

4.52 

4 

653 

2.16 

>  5 

75 

0.25 

Five  feature  types  are  used: 

1 .  Learning  from  letters:  Ambiguous  patterns  are  letters  that  may  have  different 
diacritics  (Tabic  1).  Attribute  values  arc  case  sensitive.  Delimiters  (space, 
comma,  dot,  question  mark,  and  colon),  date,  and  numbers  are  tagged  as 
SPACE,  COMMA,  DOT,  QUESTION,  COLON,  DATE,  and  NUMBER, 
respectively. 

2.  Learning  from  syllables:  In  20K  unique  tokens  in  the  corpus,  15. 5 K  tokens  are 
out-of-vocabulary  tokens,  where  no  diacritics  restoration  needs  to  be  applied. 
To  reduce  feature  space  of  training  data,  all  out-of-vocabulary  tokens  are 
tagged  with  the  same  label  “UNKNOWN”.  4.5K  tokens  which  are  syllables 
used  in  Vietnamese  dictionary,  have  equivalent  1.3K  diacritic- free  tokens 
after  removing  diacritics. 

3.  Learning  from  semi-syllables :  Focusing  oil  reduction  of  feature  space,  we 
propose  an  approach  based  on  construction  rules  of  syllables  in  Vietnamese, 
called  learning  from  semi-syllables.  In  learning  from  syllables,  each  attribute 
has  1.3K  values.  Semi-syllables  are  extracted  by  omitting  head  consonants 
from  syllables.  As  the  result,  each  attribute  has  about  100  values. 
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4.  Learning  from  words:  To  prepare  data  for  learning  from  words,  training  text 
is  preproeessed  by  word  segmenter.  In  our  work,  we  use  word  segmented  in 
[4]  which  is  claimed  to  produce  90%  accuracy. 

5.  Learning  from  bi-gram, s:  To  clarify  difference  between  learning  from  syllables 
(unigrams)  and  learning  from  words,  learning  from  n-grams  is  considered  as 
an  “intermediate  approach”.  In  Vietnamese  dictionary,  majority  of  words  are 
composed  of  2  syllables  (Table  3).  As  a  result,  bi-gram  based  learning  was 
chosen  in  onr  work. 

4  Experimental  Results  and  Future  Works 

Using  training  corpus,  2M  data  instances  of  all  ambiguous  patterns  are  created  in 
each  learning  approach.  The  evaluation  follows  10-fold  cross  validation  schema. 
The  highest  accuracy  is  achieved  by  combining  C4.5  as  the  weak  learner  and 
AdaBoost  as  the  boosting  learner.  AdaBoost  improves  the  accuracy  1.4%  against 
individual  C4.5. 

Table  4.  Comparison  of  accuracy  in  different  learning  strategies 


Learning  strategy  Accuracy 

Baseline  (most  frequent  class)  45.15 
C4.5  +  Letters  93 

C4.5  +  Semi-syllable  88.2 

C4.5  +  Word  91.9 

Cl. 5  +  Bi-gram  88.8 


AdaBoost  +  C4.5  +  Letters  94.7 


Despite  of  the  simplicity  of  features  set.  learning  from  letters  results  in  the  best 
performance.  Learning  from  semi-syllables  product's,  as  expected,  lowest  accu¬ 
racy.  Although  the  lost  of  information  is  obvious  when  all  head  consonants  are 
omitted,  a  1.6%  penalty  against  learning  from  syllables  for  the  reduction  of  fea¬ 
ture  space  (from  1000  to  100  candidates  for  each  feature)  is  an  encouraged  result. 

Discussion  about  using  n-gram  model  or  using  word  segmentation  as  prepro¬ 
cessing  phase  in  mono-syllables  languages  like  CJK  or  Vietnamese  is  continuing 
while  high  accuracy  in  word  segmentation  have  not  been  achieved  [8].  In  our 
work,  learning  from  words  performs  better  than  learning  from  syllables  and 
learning  from  bi-grams.  It  is  our  belief  that  more  accurate  word  segmenter  will 
improve  the  results  of  learning  from  words.  Using  word  based  diacritics  restora¬ 
tion  as  an  application-driven  evaluation  framework  for  word  segmentation  task 
is  a  potential  future  work. 

Constrained  sequence  classification  based  method  in  [7]  achieves  94.3%  accu¬ 
racy,  which  is  in  line  with  our  best  result.  It  should  be  noticed  that  training  data 
and  test  data  in  our  work  and  in  [7]  are  different.  Experiments  using  the  same 
training  data  and  test  data  should  he  conducted  to  get  reliable  comparison. 


1  http://www.loria.fr/  Ichoiig/tools/vuTokeni/er.php 
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5  Conclusions 

In  this  paper,  experiments  on  diacritics  restoration  are  conducted  using  different 
learning  strategies.  Experiments  results  reveal  that  learning  from  letters  achieves 
the  best  result.  On  the  other  hand,  performance  of  other  strategies  is  expected 
to  be  improved  by  using  accurate  syntactic  and  semantic  knowledge  extracted 
from  raw  text.  Our  proposed  strategy,  learning  from  semi-syllables,  produces 
slightly  lower  results  than  other  strategics.  However,  reasonable  dimensionality 
of  feature  space  and  potential  improvement  of  accuracy  shows  that  learning  from 
semi-syllables  is  not.  a  bad  choice.  In  the  worst  case,  it  could  be  used  as  a  baseline 
method. 
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Abstract.  A  feedback  framework  is  proposed  in  this  paper  to  assist  Web 
2.0  users*  t  aggings.  A  new  measure  called  Estimated,  Daily  Visit  is  defined 
arid  proposed  as  the  measure  for  tag  quality.  Quantitative  and  qualitative 
feedback  methods  are  also  defined  with  tin  measure.  A  prototype  has 
been  implemented  to  show  the  validity  of  the  framework,  and  preliminary 
result  shows  that  tin'  framework  can  successfully  enhance  quality  of  tags 
on  nser-gencrated  contents. 


1  Introduction 

Folksonomics ,  tag  annotations  of  user-generated  contents,  are  now  become  a 
common  standard  for  Social  Web  services,  linages  or  videos  in  top  search  results 
are  often  well  annotated  with  multiple  tags  of  hue  granularity.  This  can  lead  to 
false  impression  that  user-generated  contents  are  now  well  annotated  with  tags. 
However,  this  is  not  true.  Contents  that  are  annotated  inadequately  simply  do 
not  exposed  in  the  search  result  ,  due  to  their  poor  quality  of  annotations.  Many 
contents  are  annotated  with  no  tags,  too  few  or  too  general  tags  that  cannot 
help  search  engines  to  find  the  content. 

Open  nature  of  tag  annotation  is  a  sword  of  two  edges:  Creative  users  can 
always  add  new  but  useful  tags  that  are  helpful  to  describe  and  differentiate 
their  content.  Yet  naive  users  often  annotate  their  contents  with  tag  words  that 
will  never  be  used  as  a  search  term,  or  sometimes,  they  don't,  even  bother  to  tag 
at  all. 

The  goal  of  this  research  is  to  provide  help  for  this  second  group  of  users.  The 
paper  proposes  an  interactive  framework  that  helps  non-expert  users  to  under¬ 
stand  the  quality  of  their  tags  at  the  tagging  time.  The  proposed  framework  can 
provide  information  about  the  tag  words  being  attached  to  content  interactively: 
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Fig.  1.  Process  of  tag  quality  feedback  from  user’s  point  of  view 

-  How  useful  arc'  current  tags  as  search  keywords? 

-  How  specific  is  the  tag  set?  Should  the  user  need  to  add  more  terms? 

-  What  other  contents  arc  there  with  similar /same  tag  set?  And  how  many? 

-  Does  the  tag  set  make  annotated  content  distinctive  enough? 

In  this  paper,  a  set  of  measures  for  tag  quality  is  first  proposed.  The  measures  are 
then  used  in  a  framework  of  interactive  feedback  designed  to  help  Web  service 
users.  We  call  this  framework  as  tag  quality  feedback.  A  prototype  implementation 
of  this  framework  has  been  done  to  show  the  validity  of  the  framework,  and 
preliminary  result  shows  that  the  framework  can  successfully  enhance  quality  of 
tags  on  user-generated  contents. 

2  Related  Work  and  Basic  Idea  of  the  Framework 

Assisting  users  at  the  tagging  time  is  not  a  new  idea.  Tag  recommendation  is 
one  of  such  scheme  [1] [2] .  Also  there  arc  commercial  tools  that  help  finding 
tags,  useful  links  and  related  pictures  for  the  user-generated  content  [3] [4] .  The 
proposed  framework  of  this  paper  has  a  very  different  focus  compared  to  previous 
work  of  tag  recommendations.  Focus  of  this  research  is  on  comparing  tags  being 
attached  on  the  content  and  tags  previously  attached  on  existing  contents. 

The  idea  of  assigning  “quality  value*'  for  annotated  tags  appears  in  previous 
work  like  [5].  However  previous  quality  values  for  tags  are  generated  by  reliability 
of  authors,  or  redundancy  of  tags  annotated  on  the  same  contents. 

Figure  1  shows  the  tag  quality  feedback  process  from  the  user's  point  of  view. 
A  user  first  adds  a  content  and  its  initial  tags.  The  tags  are  then  analyzed  by 
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the  framework  in  terms  of  search  and  retrieval.  The  quantitative  and  qualitative 
feedbacks  are  then  given  to  the  user,  and  the  user  refines  her  tags.  Then,  the 
cycle  starts  again.  It  stops  when  the  user  is  satisfied  at  the  estimated  quality. 
The  framework  helps  users  to  decide  how  much  tag  is  enough,  and  to  see  where 
the  attached  tag  sot  will  put  the  content  among  related  contents. 

3  Tag  Quality  Measures 

3.1  Estimated  Daily  Visit 

A  good  annotation  should  not  only  correctly  reflect  the  content  (relation  between 
eontent-to-annotation),  but  also  should  perform  well  as  an  index  that  makes  the 
content  distinctive  (relation  between  aiinotation-to-aiinotation  and  annotation- 
to-query).  As  an  index,  the  role  of  annotation  is  to  help  other  users  to  locate  the 
content.  Estimated  daily  visit  count  (EDV)  is  proposed  as  a  measure  of  tags  in 
this  role.  Let  Q  be  a  set  of  queries  where  each  member  qt  is  a  query  (with  one  or 
more  terms)  which  will  make  a  search  result  that  includes  the  current  content 
being  annotated.  Then  EDV  can  be  formulated  as  follows: 


In  the  equation,  fr/r  is  total  daily  visit  count  for  the  whole  service,  Pq  is  the  prob¬ 
ability  of  the  query  q,  to  be  presented  as  a  query,  and  Ps(qi)  is  the  probability 
of  the  content  in  focus  to  be  visited  in  the  searc  h  result  of  query  qt. 

For  example,  if  a  picture  is  tagged  with  “Eiffel  Tower'  and  “Paris",  three 
queries  can  reveal  the  content  in  their  search  result.  Q  —  {  huffed  Tower.  Paris, 
Fiffiel  Tower  AND  Paris  }.  EDV  for  this  content  is  determined  by  sum  of  three 


AND  Paris)  Ps(  Eiffel  Tower  AND  Paris). 

In  the  EDV  equation,  the  summed  up  probability  is  then  multiplied  by  total 
daily  visit.  As  a  result,  EDV  value'  will  show  “how  often  your  content  will  be 
visited  by  users  via  a  search  engine,  with  current  set  of  tag  words".  Actual 
calculation  of  Pq  and  Ps  is  depending  on  the  implementation. 

In  general,  EDV  prefers  lag  sets  with  following  conditions: 

-  Larger  tag  set  than  smaller  lag  set:  a  tag  set  with  more  tags  has  more  ways 
(queries)  to  access  the  content.  In  the  equation,  set  Q  becomes  larger  with 
more  tags. 

-  Commonly  used  terms  than  unknown/rare  words:  Pq  part  goes  near  zero  if 
the  term  is  not  often  used  in  queries. 

-  Distinctive  combinations  than  common  combinations:  This  is  due  to  P*. 
Distinctive  combinations  of  common  query  words  will  yield  higher  EDV 
value. 
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3.2  Associated  Measures  for  Interactive  Comments 

To  help  users1  understanding,  three  sub-measures  are  proposed  in  this  paper. 
These  values  are  flag  values  (boolean  values)  that  can  be  feedback  to  the  user 
to  notify  possible  problems  of  the  current  annotation. 

-  Too  few  tags:  if  the  number  of  set  Q  is  smaller  than  a  given  threshold,  or 
average  of  both  average  Pq  and  Ps  is  smaller  than  a  given  threshold,  this 
flag  value  will  be  set.  This  flag  value  can  only  be  set  for  a  tag  set,  not  for 
each  individual  tag. 

-  Terms  too  rare/unknown:  if  average  Pq  value  is  lower  than  a  given  threshold, 
this  flag  value  can  be  set.  This  value  can  be  set  for  each  tag  and  a  tag  set. 

-  Too  indistinctive  combinations  of  tags:  if  average  Ps  value  is  smaller  than  a 
given  threshold.  This  flag  value  will  be  set.  This  value  can  only  be  set  for  a 
tag  set. 

Flag  values  will  be  shown  to  users  in  UI  as  comments  for  tags. 

4  Framework  for  Quantitative  and  Qualitative  Tag 
Quality  Feedback 

4.1  Qualitative  Feedback 

Quantitative  measures  are  often  not  the  best  method  for  human  users  to  see  the 
“position”  of  their  annotation.  To  help  users  to  visualize  the  effect  of  their  tags, 
the  proposed  framework  additionally  has  two  qualitative  feedback  methods. 

Listing  of  similar  contents.  Especially  for  tag  annotations  for  image  or  moving 
pictures,  this  is  an  effective  feedback  for  users  to  understand  where  the  annotated 
tags  will  put  their  content  among  other  contents.  By  comparing  the  tag  set 
attached  on  the  content  in  focus  with  other  tag  sets,  it  is  possible  to  show  some 
random  contents  that  are  tagged  with  similar  tags.  By  showing  top  n  similar 
contents,  this  method  can  achieve  its  goal  of  letting  user  to  know  what  other 
contents  are  similar  in  terms  of  annotated  tags. 

Listing  of  likely  queries .  Reaching  the  content  can  be  done  by  more  than  one  set 
of  queries,  thus  this  method  can  be  regarded  as  a  method  to  show  likely  paths 
that  will  lead  other  users  to  the  content. 

4.2  Architecture  of  Tag  Quality  Feedback  Framework 

With  EDV  and  two  qualitative  feedback  methods,  it  is  possible  to  draw  the 
architecture  of  tag  quality  feedback  framework. 

Figure  2  shows  overview  of  the  framework.  Three  major  modules  of  the  frame¬ 
work  lie  on  the  right  side  of  the  figure:  Pq.  Ps  and  qualitative  feedback  module. 
They  gain  the  data  needed  to  calculate  feedback  values  from  query  log  data  and 
the  search  engine  on  the  right.  From  these  major  modules,  three  quantitative 
values  and  two  qualitative  feedbacks  are  generated.  The  generated  values  are 
then  passed  to  Ul  for  each  cycle  of  tag  quality  feedback. 
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Fig.  2.  Overview  of  tag  quality  feedback  framework 


4.3  Prototype  Implementation  and  Preliminary  Result 

A  prototype  system  has  been  implemented  to  test  the  feasibility  of  tag  quality 
feedback  framework.  The  prototype  is  implemented  as  a  local  program  that 
is  designed  to  improve  qualities  of  tags  annotated  on  pictures.  The  program 
assumes  that  a  new  picture  is  being  uploaded  for  Fliekr.  The  search  results  and 
stat  istic  values  for  tags  are  gained  by  Fliekr  APIs. 

Several  model  probabilities  and  constant  values  must  be  set  before  implemen¬ 
tations.  tdv  value  is  a  constant  that  represent  the  total  number  of  visits*  on  the 
contents  of  the  services.  In  the  prototype,  it  was  set  as  100  million.  Pq(q,)  is  a 
value  that  represents  probability  of  the  query  <],  to  be  submitted  as  a  query.  Mod¬ 
eling  probability  of  query  q,  is  an  interesting  issue.  In  our  prototype,  it  was  not 
possible  to  access  the  query  logs  of  the  target  service,  and  P<}  has  been  replaced 
by  probability  of  a  term  to  be  appeared  as  a  tag.  That  is,  for  </,-  with  single  term, 
PJqt)  =  number  of  term  observed/ number  of  all  tags  observed.  Also  for  p,  with 
more  than  one  term,  it  was  defined  as  Pq{qu  AND  q3 )  =  Pq{(]h)Pq(qj)-  P*{Qi)  i* 
a  value  that  represents  probability  of  the  content  in  focus  to  be  visited  among 
the  search  result  of  query  qt .  In  the  prototype  simple  IDF-like  value  was  used. 
That  is,  P.s(qi)  —  c/number  of  contents  in  the  search  result  of  <7,.  Constant  r  is 
the  average  number  of  visiting  upon  a  search  result.  I11  the  prototype,  optimistic 
value  of  40  has  been  used  as  constant  c. 

The  qualitative  feedback  was  also  prepared  similarity.  To  got  similar  contents, 
members  of  set  Q  are  queried  upon  the  target  service  sequentially  from  qt  with 
the  longest  one  to  the  smallest  one.  The  first  20  pictures  gained  bv  this  method 
are  shown  to  the  user  as  similar  contents.  Listing  of  likely  queries  can  be  gained 
by  providing  a  number  of  top  qi  with  higher  Pq(qi)Ps{qi).  In  the  prototype,  three 
most  likely  queries  and  the  actual  number  of  contents  resulting  from  each  query 
are  given  back  to  users. 

With  this  setup,  a  small  preliminary  experiment  was  done  with  100  selected 
Fliekr  images.  The  images  have  been  selected  from  larger  set  of  images  that  have 
only  one  tag  with  minimum  EDV  value  of  1.0.  Two  test  users  were  requested  to 
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use  the  prototype  system  to  interact  and  add  tag  annotations  to  each  picture. 
Testers  have  been  instructed  to  interact  with  the  feedback  output  at  least  two 
cycles.  The  refined  tags  after  the  feedback  achieved  much  higher  quality  in  EDV 
values:  average  EDV  value  of  42.1  and  57.3.  The  average  number  of  tags  were 
4.3  and  7.2. 

This  preliminary  experiment  shows  that  the  proposed  tag  quality  feedback 
framework  can  enhance  quality  of  tags  annotated  on  user-generated  contents. 
However,  this  preliminary  experiment  cannot  replace  real  accessibility  test  of 
the  system.  For  example,  it  is  not  shown  yet  how  normal  users  would  accept 
the  response/ feedback,  or  how  this  feedback  would  change  typical  behavior  of 
naive  users.  Also,  it  is  yet  to  be  shown  that  EDV  value  and  the  actual  number  of 
content  visit  have  positive  correlations.  Evaluating  various  aspects  of  tags,  tag 
quality  feedback  and  Web  2.0  users  is  prominent  future  work  for  the  tag  quality 
feedback  framework. 

5  Conclusions 

A  framework  for  tag  quality  feedback  is  proposed  in  this  paper.  A  measure  called 
“estimated  daily  visit”  is  first  derived  to  reflect  the  likelihood  of  annotated  tags 
to  reveal  the  content  in  terms  of  keyword  search.  Three  associated  measures 
and  two  qualitative  feedback  methods  are  also  devised  to  help  naive  users  to 
edit  their  tags  to  get  better  score.  There  is  a  lot  of  future  work  remains  for  the 
framework.  Assumptions  of  EDV  are  based  on  search  processes,  and  to  prove 
those  assumptions  are  right,  tag  sets  optimized  to  EDV  should  actually  have 
significant  higher  number  of  content  visitors.  It  would  need  a  long  time  evaluation 
with  real-world  environment,  and  it  would  also  need  controlled  contents  with 
original  tag  annotations  and  EDV  optimized  tag  contents. 
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Abstract.  Recently,  due  to  proliferation  of  mobile  devices,  we  can  collect 
users'  life-log.  Human  long-term  memory  is  an  interconnected  network.  The 
retrieval  system  of  it  is  cue-dependent.  Semantic  networks  are  used  to 
implement  it  of  human  retrieval  system.  It  is  possible  to  retrieve  relevant  data 
more  effectively  by  using  a  search  system  based  on  network  visualization  which 
provides  relations  among  data  rather  than  a  text-based  search  system  This 
paper  proposes  representation  of  semantic  networks  of  mobile  life-log  based  on 
activity  theory,  and  associatively  finds  data  based  on  network  visualization  for 
it.  We  have  implemented  the  system,  searched  data  from  an  example  of  search, 
and  performed  a  subjective  test.  As  a  result,  we  have  confirmed  that  this  system 
is  useful  for  associative  retrieval  resembled  to  human  cue-dependent  recall 

Keywords:  Mobile  Log,  Semantic  Networks,  Associative  Search,  Network 
Visualization. 


1  Introduction 

Recently,  because  of  widespread  of  mobile  devices,  it  is  possible  to  collect  and 
manage  various  user  information  through  them  ealled  mobile  life-logs  such  as  a  user’s 
calls,  SMS  (short  message  service),  photography,  music-playing  and  GPS  (global 
positioning  system)  information.  Since  the  amount  of  these  data  increase 
exponentially,  it  is  important  to  retrieve  data  needed. 

Semantic  networks  have  a  merit  for  storing  mobile  life-log.  Mobile  life-log  is  one 
of  the  auxiliary  memory  units  for  a  person.  Information  is  saved  as  an  interconnected 
network  in  Human  long-term  memory.  Associative  search  means  the  cue-dependent 
retrieval  system  of  human  interconnected  memory  [1].  A  representation  of  mobile 
life-log  should  support  associative  search  like  human  retrieval  system.  Semantic 
networks  are  more  suitable  than  relational  database  systems  for  it.  In  this  paper,  to 
make  an  effective  representation  of  mobile  life-logs  that  express  user’s  context, 
context  model  of  activity  theory  is  adopted.  According  to  this  theory,  user's  context 
should  be  formed  by  activity  [2]. 

Associative  search  system  is  effective  to  relevant  search.  In  previous  studies,  text- 
based  associative  search  systems  are  mainly  presented.  It  is  not  enough  to  fully  utilize 
the  strength  of  semantic  networks  because  it  does  not  include  relations  between  data. 
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This  paper  proposes  the  semantic  network  representation  of  mobile  life-logs  based 
on  activity  theory  for  a  visualization-based  associative  search  of  human  memory. 


2  Activity  Theory 

Activity  theory  is  a  powerful  and  clarifying  descriptive  tool  rather  than  a  strongly 
predictive  theory.  The  object  of  activity  theory  is  to  understand  the  unity  of 
consciousness  and  activity.  Activity  theory  incorporates  strong  notions  of 
intentionally,  history,  mediation,  collaboration  and  development  in  constructing 
consciousness  [2].  Context  model  of  activity  theory  assumes  a  subjective  view  on 
situations.  This  is  in  contrast  to  the  prevailing  view  where  context  normally  describes 
an  objective  defined  situation.  Any  experience  is  personal  [3].  In  this  paper,  a 
semantic  network  representation  of  mobile  life-log  is  designed  based  on  context 
model  of  activity  theory. 


Table  1.  Elements  of  Context  model 


Type  of  context 

Meaning 

Environmental  context 

Personal  context 

Social  context 

Task  context 
Spatio-temporal  context 

Users’  surroundings  accessed  by  the  user. 

The  mental  and  physical  information  about  (he  user. 

The  social  aspects  of  the  user  like  roles. 

The  user’s  goals,  tasks  and  activities. 

Time,  location  and  the  community  present. 

3  Proposed  Method 

After  collecting  log  data  which  are  GPS,  Call,  SMS,  picture  viewer,  photo,  MP3, 
charging,  and  action,  the  system  generates  a  semantic  network  from  mobile  life-logs 
following  the  defined  a  representation.  It  visualizes  pre-structured  a  semantic 
network.  Next,  a  user  search  data  using  selection  and  keyword  associative  search  on  a 
visualized  semantic  network.  It  provides  a  visualized  result  graph  structured  in 
relational  data  and  relationship  among  data.  The  semantic  abstraction  helps  a  user 
understand  the  result  retrieved  information.  Figure  1 .  shows  the  entire  system  for  the 
mobile  life-log  semantic  network  in  this  paper. 

3.1  Design  for  a  Representation  of  Semantic  Networks  of  Mobile  I  Jfe-Log 

Context  model  of  activity  theory  is  referenced  to  define  a  semantic  network 
representation  of  mobile  life-log.  In  this  representation,  ‘user'  node  which  expresses 
users’  profile  is  the  root  node.  A  next  type  of  node  linked  up  with  the  root  node  is 
‘category’  of  actions.  ‘Action’  is  followed  by  a  category  node.  Since  ‘place’  and  ‘date 
and  time’  are  important  factors  to  infer  an  action,  they  are  connected  to  an  action 
node.  The  last  type  is  related  to  ‘functions’  of  a  mobile  device.  Table  3  shows  the 
definition  of  types  of  nodes.  Table  2  provides  information  on  how  context  model  is 
adapted  to  this  representation. 
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Table  2.  Mapping  node  types  to  Elements  of  context  model  of  activity  theory 


Context  of  activity  theory  Node  Type 


Activity 

Personal  context 
Task  context 
Spatio-Temporal  context 
Environmental  context 
Social  context 


Category,  Action 
User 

Function(Playing  a  music,  taking  a  picture) 

Place,  Date  and  Time 

Content 

Function  (Call,  SMS) 


Fig.  1.  The  associative  search  system  for  a  mobile  life-log  semantic  network  A  user  can 
retrieve  data  by  selection  associative  search  on  a  left  side  and  by  keyword  associative  search  on 
a  right  upper  side  text  box  and  a  button.  Ontology  graph  is  shown  in  the  center  of  a  right  side. 

3.2  Associative  Search 

Text-based  associative  search  systems  are  limited  to  associatively  search  data,  since  it 
is  hard  to  express  the  relationship  between  data.  Therefore,  methods  of  associative 
search  based  on  network  visualization  are  needed.  Selection  search  means  a  way  to 
find  relative  data  through  selecting  a  node  on  semantic  networks  visualized.  When  a 
node  is  chosen,  its  directly  relative  data  are  shown  to  a  user.  Also,  a  user  can  click  a 
node,  one  of  the  retrieved  nodes.  A  user  finds  data  through  selecting  a  node,  step  by 
step  This  process  is  resembled  to  human  retrieving  memory. 

In  addition,  this  system  contains  a  function  of  keyword  associative  search  If  the 
system  has  an  only  selection  search  of  finding  data,  time  is  spent  on  retrieving 
information.  The  pseudo-code  for  keyword  associative  search  is  introduced  by 
Figure  2.  Its  input  is  a  keyword  as  a  query,  output  is  a  result  graph  structured  in  relative 
precedent  nodes,  descendent  nodes,  and  their  relationship. 
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Input:  string  keyword.  Graph  S _ 

function  Graph  KcywordAssociativeScarch 
Graph  ResultGraph 

Node  KeywordNode  =  DFS(keyword,S).Result(); 
ResultGraph. add(DFS(key  word, SfRoutesQ); 
ResultGraph.  add(DFS(Key  word,  S).Result()); 
ResultGraph. add(Traversc(keywordNode,S).Routcs()); 
return  ResultGraph; 

end 


Fig.  2.  The  pseudo-code  for  keyword  associative  search 


3.3  Semantic  Abstract  for  Semantic  Networks 


Semantic  abstraction  can  show  a  representation  of  a  network  more  effectively. 
Semantic  abstraction  introduced  by  Shen  et  al.  (2006)  [4]  is  adapted  to  semantic 
networks  in  this  paper.  Ontology  graph  means  a  graph  of  that  nodes  represent  types  of 
nodes  of  a  network  and  of  that  edges  and  their  relationship.  Semantic  abstraction  can 
simplify  networks  without  removing  nodes  of  types  they  want  to  Find.  A  simplified 
network  is  named  an  induced  graph.  An  induced  graph  can  be  constructed  by  user’s 
selecting  in  the  ontology  graph.  Types  of  nodes  are  named  type  nodes  and  Types  of 
edges  means  type  edge.  An  instance  of  Ontology  graph  is  shown  in  Figure  2. 
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means  a  radius  of  /  type  node.  Maximum  size  of  a  type  node  is  named  C,  a  constant 
value.  NC(  is  defined  as  the  number  of  nodes  of  i  type.  Also,  NCmax  is  the  maximum 
count  of  nodes  of  type  .£(  refers  to  width  of  /  type  edge.  C  represents  maximum  size 
of  a  type  edge.  ECt  is  count  of  edges  of  /  type.  ECmax  is  largest  in  the  number  of 
edges  of  type. 


4  Experiments  and  Evaluation 

We  use  Mobile  log  data  collected  from  a  college  student  during  3  days.  The 
constructed  mobile  life-log  semantic  network  contains  109  nodes  and  106  edges.  Node 
XL  library  is  used  for  graph  visualization  (http://www.codeplex.com/NodeXL). 

A  given  query  is  " What  is  the  message ,  the  SMS,  during  watching  a  movie ”.  It 
means  that  she  docs  not  know  any  information  except  for  ‘SMS’  and  ‘watching  a 
movie'.  Figure  3  shows  process  to  traverse  a  mobile  life-log  semantic  network  by 
using  associative  search.  Selection  associative  search  is  shown  in  Figure  3(a)  and  (b) 
Although  not  enough  information  is  given,  a  user  can  find  data  by  reminding  relevant 
data  step  by  step.  By  using  keyword  associative  search,  she  can  see  background  of 
each  message  Figure  3(c)).  If  the  result  graph  is  very  complex,  it  can  be  simplified  by 
semantic  abstraction  (Figure  3(d)). 


Semantic  Networks  of  Mobile  Life-Log  for  Associative  Search 


647 


In  order  to  validate  the  usefulness  of  the  proposed  method,  we  performed  a 
subjectivity  test  about  the  implemented  application  for  ten  users  based  on  the  System 
Usability  Scale  (SUS)  questionnaires.  The  SUS  is  a  simple,  ten-item  scale  giving  a 
global  view  of  subjective  assessments  of  usability  where  its  score  has  a  range  of  zero 
to  one  hundred  [5].  Figure  4(a)  shows  the  SUS  test  results. 
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Fig.  3,  An  example  of  associative  search,  (a)  the  initial  state  of  selection  associative  search  (b) 
the  seeond  slate  after  selecting  ‘leisure’  category  and  ‘watehing  a  movie'  action  in  selection 
associative  seareh  (c)The  result  graph  for  the  'SMS'  keyword  (d)  The  indueed  graph  extracted 
from  the  mobile  life-log  semantic  network  exeept  for  ‘Contents'  type,  ‘Phone  number'  type, 
and  ‘function1  type. 


(a)  (b) 


Fig.  4.  (a)SUS  scores  for  the  proposed  system  (b)  Seores  for  usability  test  to  evaluate  search 
system 
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To  compare  proposed  associative  search  system  with  previous  text-based  semantic 
network  method,  three  questions  on  Table  3  are  added  on  usability  test.  This  test 
result  is  shown  as  Figure  4(b).  These  results  indicated  that  the  associative  search 
based  on  visualization  provides  effective  ways  to  retrieving  data. 

Table  3.  Questionnaires  of  the  usability  test  to  evaluate  search  system 


No. 

Questionnaire 

Strongly 

Disagree 

Strongly 

agree 

I 

I  think  search  features  provided  is  useful 

1 

2 

3 

4 

5 

2 

I  think  a  way  to  provide  search  results  is  effective 

I 

2 

3 

4 

5 

3 

Search  results  is  satisfied  with  me 

1 

2 

3 

4 

5 

5  Conclusion 

In  this  paper,  wc  presented  a  design  for  semantic  networks  of  mobile  life-log  for 
associative  search.  Mobile  life-logs  can  support  human  memory.  For  human-like 
retrieval,  we  stored  life-logs  in  semantic  networks.  The  semantic  network 
representation  of  mobile  life-logs  is  based  on  activity  theory.  In  addition,  we 
presented  associative  search  for  efficient  retrieval  based  on  visualization.  It  has 
selection  and  keyword  associative  search  of  that  result  is  shown  as  a  visualized  result 
graph  to  provide  relationship  between  data.  For  users’  understanding,  this  graph  can 
be  simplified  by  semantic  abstraction.  We  showed  that  the  proposed  method  was  able 
to  find  related  data  easier  and  to  help  users’  understanding. 
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Abstract.  In  standard  fighting  videogames,  since  opponents  controlled 
by  computers  are  in  a  rut,  the  user  lias  learned  their  behaviors  after 
long  play  and  gets  bored.  Thus  we  propose  an  adapting  opponent  with 
three  subagent  architecture  that  adapts  to  the  level  of  the  user  by  re¬ 
inforcement  learning.  The  opponent  was  evaluated  by  human  users  by 
comparing  it  against  static  opponents. 


1  Introduction 

Fighting  videogames  are  a  popular  genre  of  videogames.  A  fighting  videogame 
is  a  simulation  of  hand-to-hand  combat  and  is  designed  to  be  played  by  at  least 
two  users  competitively.  However,  these  games  can  also  be  played  bv  only  one 
user.  In  this  case,  the  machine  will  take  control  of  the  opponent.  If  given  the 
option  however,  users  may  prefer  to  play  against  other  users. 

We  assume  that  one  of  the  main  reasons  users  prefer  to  play  against  other 
users  is  that  the  A1  found  in  standard  videogames  is  uninteresting.  Typically  it 
is  of  a  simple  design  [1],  e.g\.  Finite  State  Machines  [3],  which  means  that  AI  in 
standard  videogames  is  not  complex  enough  to  learn  users’  patterns. 

Nevertheless,  learning  the  user  behavior  and  adapting  to  it  in  order  to  defeat 
the  user  should  not  be  the  aim.  An  opponent  that  behaves  so  would  learn  to 
easily  defeat  the  user  and  it  is  not  interesting.  Therefore,  our  aim  is  to  adapt  to 
the  level  of  the  user.  Here  lies  the  novelty  of  our  research. 

2  Fighting  Videogame 

hi  typical  fighting  videogames  the  first  player  that  lowers  the  health-points  (HP) 
of  the  opponent  to  zero  is  the  winner  of  a  round.  The  winner  of  a  fight  is  the 
best  of  several  rounds.  Some  videogames  have  a  time  limit  per  round. 

"The  set  of  available  actions  in  a  game  is  X  U  D  U  C  U  B  U  M.  X  is  the 
set  of  simple  attacks.  Simple  attacks  deal  moderate  damage.  D  is  the  set  of 
defensive  actions,  or  blocks.  Blocks  guard  the  character  from  simple  attacks. 
C  is  the  set  of  combos.  Combos  are  predefined  combinations  of  simple  attacks 

B.-T.  Zhang  au<l  M.A.  Orgun  (Edft.):  PRICAI  2010,  LNAI  6230.  PP-  649  054.  2010. 
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that  deal  significant  damage.  B  is  the  set  of  combo-breakers .  Combo-breakers 
are  special  combinations  that  counter-attack  combos.  Each  combo  c*  £  C  might 
have  a  different  combo-breaker  bt  €  B .  When  the  corresponding  cornbo- breaker  is 
executed  before  the  attacking  player  finishes  delivering  the  combo,  the  receiving 
player  will  not  receive  the  extra-damage  of  the  combo.  M  is  the  set  of  movements 
the  players  use  to  navigate  the  character.  Different  fighting  videogames  vary  in 
the  details  and  design  of  the  possible  actions. 


3  Proposal 

We  propose  an  agent  that  learns  to  adapt  to  the  user  in  a  fighting  videogame. 
This  agent  controls  a  character  as  the  opponent  of  the  user.  We  divided  the  agent 
into  three  subagents,  each  of  which  is  in  charge  of  handling  some  types  of  actions 
of  fighting  videogames.  They  are  Main  Subagent  (MSA),  Executing-Combo  Sub¬ 
agent  (ECSA)  and  Receiving-Combo  Subagent  (RCSA).  Since  videogames  must 
run  in  real-time,  all  the  learning  is  delayed  until  the  end  of  each  round.  From 
the  agent’s  point  of  view,  one  round  equals  one  episode. 


3.1  Main  Subagent  (MSA) 

MSA  is  in  charge  of  executing  simple  attacks,  blocks,  and  moving.  When  deemed 
appropriate,  MSA  passes  the  control  to  one  of  the  other  subagents.  MSA  is 
modeled  as  a  Profit-Sharing  agent  [2]. 

At  the  th  turn  of  the  episode  n.  MSA  first  recognizes  the  environment  as  a 
state  St  and  looks  up  recorded  weights  wn  i (s*,al)  of  all  actions  a 1  available  in 
St-  After  that,  MSA  chooses  an  action  (it  with  the  probability  calculated  from 
the  weights  using  Boltzmann  equation  [5],  with  temperature  r: 


cxp(wn_i(.s  t.a‘)/T) 
Hk  exp(®n-  J  (*'t  t  )/T) 


(1) 


The  agent  records  the  pair  (.s*,«/)  and  executes  at.  The  available  actions  are 
those  defined  in  the  videogame  in  question,  plus  passing  control  of  the  character 
to  ECSA  or  RCSA.  After  the  action  lias  been  executed,  or  the  subagent  executes 
its  action,  MSA  resumes  control  of  the  character. 

At  the  end  of  the  episode,  MSA  receives  a  reward  Rn  from  the  environment 
arid  updates  the  weight  of  all  recorded  pairs  ($t,  (it)  of  this  episode  by  the  fol¬ 
lowing  rule.  T  is  the  last  turn  in  the  episode. 


wn(st,at)  :=  u>„_i  (st.  at)  +  R„-')T  *  (1  <  t  <  T).  (2) 


MSA  receives  higher  positive  rewards  when  the  difference  of  the  final  HPs  of 
the  agent  and  the  user  is  small,  although  negative  rewards  are  given  when  the 
difference  is  significant.  This  reinforces  actions  that  lead  the  agent  to  behave  in 
such  a  way  that  it  is  not  too  difficult  nor  too  easy  for  the  user.  That  is,  we  are 
reinforcing  actions  that  put  the  agent  at  the  same  level  of  the  user. 
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3.2  Executing-Combo  Subagent  (ECSA) 

ECSA  has  the  responsibility  of  choosing  combos  and  executing  them.  In  order 
for  the  agent  a s  a  whole  to  be  at  the  same  level  of  t  he  user,  the  combos  the  agent 
executes  must  also  be  on  a  level  close  to  that  of  the  user. 

Since  the  agent  must  act  in  real-time  during  a  round,  ECSA  randomly  selects 
a  combo  from  its  combo  set  Ca  C  C  and  executes  the  selected  one  when  invoked. 
Therefore,  the  problem  is  how  to  create  Ca •  If  we  consider  the  set  of  combos 
used  by  the  user.  Cu  C  C,  the  goal  of  ECSA  is  to  create  Ca  of  similar  difficulty. 

To  create  C\.  we  need  metrics  to  order  sets  by  their  difficulty.  We  use  the 
following  three:  ratio  of  used  combos ,  indistinyuishability  of  combos,  and  entropy 
of  combo-breakers.  In  the  following  definitions  of  metrics,  C  is  the  set  of  available 
combos  for  the  game  in  question,  and  Cr  C  C  is  a  set  of  combos. 


Ratio  of  used  combos:  A  better  user  would  execute  a  wider  variety  of  combos 
because  it  would  make  it  difficult,  for  the  opponent  to  predict  the  combo- 
breakers.  Hence,  the.  ratio  of  used  combos  is  a  valid  metric: 


used-ratio(Cf) 


n 

KV 


(:») 


Indistinguishability  of  combos:  Since  the  combo- breaker  must  be  executed 
before  the  last  action  of  the  combo,  a  set  of  combos  that  are  indistinguishable 
given  the  initial  actions  is  more  difficult  than  a  set  where  the  combos  can  be 
distinguished  by  their  initial  actions.  This  can  be  formalized  as  follows: 


i  1C  I  combos  with  repeated  initial  actions  in  Cf  I 

nids(C  )  =  - —\ - .  (4) 

Entropy  of  combo-breakers:  The  sot  of  combos  sharing  a  combo-breaker  is 
easier  than  that,  of  combos  having  different  combo-breakers,  because  a  player 
playing  against  the  former  need  not  decide  which  combo-breaker  should  be 
executed.  Hence,  the  entropy  of  the  set  of  distinct  combo-breakers,  B'  of  Cf 
is  a  valid  metric: 


cntr(C ') 


-Ehefl-  C(6)lo &P(b) 
log  \W\ 


(5) 


where  P{b)  is  the  probability  of  randomly  choosing  a  combo- breaker  b  out 
of  the  cornbo- breakers  of  C* . 


ECSA  first  creates  a  combo  set  containing  m  combos,  whose  combo-breakers 
and  initial  actions  are  different  (high  combo-breaker  entropy,  low  indistingnisha- 
bility).  We  consider  that  such  an  initial  set  is  not  too  difficult,  but  it.  is  riot  too 
easy  either. 

After  finishing  an  episode,  this  combo  set  is  partially  adapted  to  that  of  the 
user  by  the  algorithm  presented  in  Fig.  I.  A  defines  the  level  of  tolerance  in 
the  difference  of  sizes  of  the  sets,  and  max_iter  limits  the  number  of  tries.  Al¬ 
though  only  the  subroutine  delete  is  presented  along  with  the  algorithm,  the 
subroutines  add  and  swap  follow  the  same  idea  as  delete. 
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adaptECSAO  : 

if  (used-ratio(CA)  >  used-ratio(Cu)  +  A)  delete(CA)  ; 
elif  (used-ratio(CA)  <  used-ratio(Cu)  -  A)  add(CA); 
else  swap(CA) ; 

delete(CA)  : 

for  (i:=0;  i  <  max-iter;  i++) 

c:=  a  combo  in  CA  chosen  randomly; 
if  ( I entr(CA\{c>) -entr(Cu) |  <  I entr(CA)-entr(Cy) I 
or  I inds (CA\{c})~inds (Cy) I  <  I inds(CA)-inds(Cu)  | ) 
CA  :*  CA\{c>;  return; 
end  if 
endf or 


Fig.  1.  Pseudo-code  of  EOS  A 


Fig.  2.  1  ighting  videogame 


3.3  Receiving- Combo  Subagent  (RCSA) 

The  design  of  RCSA  is  presented  extensively  in  [4].  This  subagent  basically  mines 
the  patterns  of  the  combos  executed  by  the  user  after  each  episode. 

When  invoked  during  a  round  RCSA  matches  the  combos  executed  by  the 
user  with  mined  patterns;  using  the  matched  patterns,  RCSA  predicts  the  next 
possible  combos.  Then  RCSA  chooses  the  combo-breaker  stochastically  based 
on  the  relative  frequency  of  the  predicted  combos.  For  more  details,  see  [4]. 


4  Experiments 

Wc  developed  a  simple  fighting  videogame  using  Crystal  Space  3D  [6]  to  test  the 
proposed  adapting  agent.  An  image  of  the  videogame  is  Fig.  2. 

The  fighting  videogame  lias  the  following  characteristics:  the  fights  occur  in 
a  2D  plane;  the  characters  have  a  height  of  3.5  units  and  a  width  of  2  units; 
the  stage  is  a  finite  platform  with  a  length  of  42  units,  falling  from  the  platform 
equals  losing  the  fight;  there  is  no  time  limit;  there  is  one  round  per  fight;  the 
initial  HP  of  the  characters  is  200;  the  set  X  of  simple  attacks  contains  punch 
(p).  kick  (k),  and  special  attack  (s),  the  last  one  being  a  long  range  projectile 
attack;  each  of  the  simple  attacks  deal  one  point  of  damage;  the  set  D  of  defenses 
contains  one  action:  block;  while  blocking,  simple  attacks  do  not  have  effect;  the 
set  C  of  combos  is  listed  in  Table  1;  for  a  combo  to  be  valid,  each  action  must  be 
executed  within  0.5  seconds  of  t lie  previous  one;  the  combo-breaker  of  a  combo 


Table  1.  Combos 


ID 

Act 

Damage 

ID 

Act  Damage 

0 

pppp 

15 

6 

kppk 

20 

1 

PPPk 

20 

7 

kpps 

25 

2 

ppps 

25 

8 

kspp 

15 

3 

pkpp 

20 

9 

ksps 

20 

4 

pkpk 

15 

10 

kpsp 

30 

5 

pkps 

25 

11 

kkkk 

30 

Table  2.  Rewards 


HP  cliff 

Reward 

HP  diff 

Reward 

<  25 

+  1.00 

<  125 

-0.25 

<  50 

+0.75 

<  150 

-0.50 

<  75 

+0.25 

<  175 

-0.75 

<  100 

-0.10 

>  175 

-1.00 
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is  defined  as  its  last  action:  if  the  combo  breaker  is  valid,  the  character  executing 
the  combo  receives  its  damage;  the  set  M  of  movements  contains  move  to  the 
right  ,  move  to  the  left,  jump  and  crouch. 

The  proposed  agent  was  used  with  the  following  parameters: 

States:  To  keep  the  design  of  the  agent  simple,  we  discret  ized  the  world  state 
as  follows.  These  were  selected  because  they  provided  enough  information 
to  the  agent  to  make  intelligent  decisions:  (a)  crouching  or  not  (agent/ user), 
(b)  jumping  or  not  (agent /user),  (c)  receiving  a  combo  or  not  (agent),  (d) 
executing  a  simple  attack  or  not  (user),  (e)  blocking  or  not  (agent /user),  (f) 
at  an  edge  of  the  platform  or  not  (agent),  (g)  HP  <  30  or  not  (user),  (h) 
the  distance  between  the  user  and  the  agent,  discretized  in  eight  sections: 
<  0.25,  <  0.50,  <  2.00,  <  2.60.  <  4.00,  <  10.00,  <  24.50  and  >  24.50,  (i) 
the  distance  from  the  agent  to  the  closest  special  attack  thrown  by  the  user, 
discretized  in  three  sections:  <  0.50,  <  2.60  and  >  2.60,  and  (j)  the  difference 
in  HP  between  the  user  and  the  agent,  rounded  to  tens. 

Actions:  The  available  actions  were  those  available  in  the  game,  plus  ECS  A, 
RCSA,  and  stay.  Instead  of  the  actions  right  and  left,  the  agent  used  ap¬ 
proach  and  withdraw.  Approach,  withdraw,  crouch,  and  block  were  executed 
for  0.1  seconds.  Stay  had  a  duration  of  0.4  seconds.  All  the  other  actions 
lasted  as  long  as  it  took  to  fully  execute  them. 

Rewards:  The  reward  was  defined  as  Table  2.  The  HP  difference  was  the  ab¬ 
solute  difference  between  the  HP  of  the  user  and  the  agent. 

Others:  7  was  fixed  at  0.99.  r  was  fixed  at  1.0.  m  =  3.  A  =  0.1.  max  iter  —  20. 
The  number  of  tracked  patterns  in  RCSA  [4]  was  five. 

For  comparison  purposes  we  developed  three  static  agents:  weak,  medium  and 
strong.  The  weak  agent  was  very  easy  to  defeat;  50%  of  its  action  were  to  stay: 
it  only  executed  combos  0  and  10  of  Table  1;  the  combo-breaker  was  always  p 
The  medium  agent  was  obtained  by  training  our  adapting  agent  against  a  user 
for  20  rounds  in  advance,  while  it  did  not  adapt  during  fights.  The  strong  agent 
was  very  difficult  to  defeat;  it  always  got  close  to  the  user  and  executed  one  of 
all  available  combos  randomly  whenever  close  enough;  the  combo  breakers  were 
chosen  stochastically  based  on  the  distribution  of  combo-breakers  for  the  initial 
actions  executed  by  the  user. 

We  compared  these  static  agents  against  two  versions  of  our  adapting  agent: 
adapO  and  adapF.  The  adapO  agent  was  as  explained  in  Section  3.  The  adapF 
agent  was  structurally  the  same  as  adapO  but  it  had  been  trained  by  playing  20 
rounds  beforehand. 

We  asked  28  real  users  to  play  the  game  between  15  and  30  rounds  against 
each  ageut.  The  users  were  of  different  nationalities,  ages  and  with  different  level 
of  expertise  at  playing  videogames.  After  the  first  15  rounds  with  an  agent,  the 
user  could  quit  whenever  he/she  was  no  longer  having  fun.  The  users  did  not 
know  the  characteristics  of  each  agent.  The  order  of  the  agents  was  randomized 
for  the  users.  The  users  filled  in  a  questionnaire  after  the  experiments.  They 
were  asked  to  order  the  agents  from  most  fun  to  least  fun. 
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Fig.  3.  (Left)  Questionnaire  results.  (Right)  Analysis  of  playing  length 


The  results  of  the  questionnaire  are  shown  in  Fig.  3  (Left).  This  band  chart 
indicates  how  many  subjects  rated  the  opponents  as  the  corresponding  rank. 
The  adapF  received  the  least  amount  of  negative  ratings. 

We  also  compare  the  length  of  play  against  eaeli  opponent.  For  each  sub¬ 
ject,  the  opponents  were  ranked  in  descending  order  of  the  length.  In  ease 
of  draw,  both  opponents  were  in  the  same  rank.  The  comparison  is  shown  in 
Fig.  3  (Right).  Similar  to  the  questionnaire,  the  proposed  agents  were  the  oppo¬ 
nents  that  figured  less  in  the  least  played  opponents. 

5  Conclusion 

An  agent  that  adapts  to  the  level  of  the  user  in  a  fighting  videogame  was  de¬ 
veloped.  The  adapting  agent  is  divided  into  three  subagents:  MSA,  ECSA  and 
IlCS A,  each  of  which  is  in  charge  of  handling  different  aspects  of  fighting.  In 
comparison  with  static  agents,  the  adapting  agent  received  the  least  amount  of 
negative  ratings. 
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Abstract.  Acquiring  knowledge  directly  from  the  domain  expert  requires  a 
knowledge  representation  and  specification  method  that  is  comprehensible  and 
feasible  for  the  holder  and  creator  of  that  knowledge.  The  technique,  known  as 
multiple  classification  ripple  down  rules  (MCRDR),  is  novelly  applied  to  the 
problem  of  building  and  maintaining  a  library  of  training  scenarios  for  use  by 
customs  and  immigration  officer  trainees  in  our  agent-based  virtual  environ 
ment  which  may  he  indexed  for  retrieval  based  on  the  rules  associated  with 
them.  Our  evaluation  study  aims  to  demonstrate  the  utility  of  the  MCRDR 
combined  case  and  exception  structure  rule-based  approach  over  standard  rules 
alone  and  a  non-ease- based  approach. 

Keywords:  Ripple  down  rules,  scenarios,  training  simulation 


1  Introduction 

The  comprehensibility  and  usability  of  knowledge  structures  have  received  less  atten¬ 
tion  than  their  correctness,  completeness  and  consistency  [7].  In  recognition  that  ac¬ 
quiring  knowledge  has  been  a  bottleneck  in  the  development  of  knowledge  based 
systems  (KBS)  [5]  further  leading  to  validation  and  maintenance  issues,  it  is  impor¬ 
tant  that  the  knowledge  representation  and  acquisition  method  be  accessible  and  man¬ 
ageable  by  a  human.  In  cases  were  the  knowledge  is  acquired  via  machine  learning, 
maintenance  and  acquisition  by  the  human  is  less  of  an  issue.  However,  the  output  of 
these  algorithms  should  be  comprehensible  to  the  human.  Quinlan  [6]  refers  to  Don¬ 
ald  Michie's  requirement  that  concept  expressions  must  be  “correct  and  effectively 
computable  descriptions  that  can  be  assimilated  and  used  by  a  human  betng”  going  so 
far  as  to  regard  knowledge  representations  which  are  not  comprehensible  to  the  do¬ 
main  expert  as  not  qualifying  as  knowledge. 

We  are  currently  developing  an  agent-based  virtual  training  environment,  known 
as  BOrdcr  Security  System  (BOSS),  for  trainee  airport  customs  and  immigration 
officers  to  determine  if  a  passengers  should  be  allowed  entry  into  Australia  Ripple 
Multiple  Classification  RDR  (RDR)  [3]  have  been  used  to  address  many  different 
problems  within  many  application  domains.  However,  novelly  we  have  employed 
MCRDR  to  represent  and  capture  the  knowledge  needed  in  an  agent-based  virtual 
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environment  training  simulation.  The  agent’s  knowledge  (i.e.  what  to  say,  what  to  do, 
how  to  respond)  and  the  domain  knowledge  to  be  passed  to  the  trainee  are  both  cap¬ 
tured  using  MCRDR.  In  this  way  we  concurrently  and  interactively  within  the  training 
environment  train  the  software  agents  and  the  human.  The  significance  of  using  a  KR 
which  can  be  employed  by  the  domain  trainer/expert  is  that  it  becomes  feasible  to 
deploy  the  system  because  we  can  move  beyond  a  research  prototype  containing  a 
handful  of  handcrafted  scenarios. 

In  a  previous  study  involving  36  participants  we  found  a  statistically  highly  signifi¬ 
cant  difference  (1.63384E-1 1)  using  a  onc-tailed  t-Test:  Two-Sample  Assuming 
Equal  Variances,  in  the  scores  achieved  on  pre  and  post  test  knowledge  tests  for  the 
border  security  domain  after  using  our  system.  Given  that  our  training  system  was 
found  to  be  a  useful  way  to  train,  we  seek  to  address  a  significant  impediment  to  the 
widespread  use  of  virtual  environments  as  training  systems:  acquiring  and  maintain¬ 
ing  the  1)  training  scenarios,  2)  domain  knowledge  and  3)  agent/avatar  behaviours.  In 
this  paper  we  focus  on  the  latter  two  issues. 

The  goal  of  this  paper  was  to  evaluate  if  users  found  MCRDR  to  be  a  more  com¬ 
prehensible  knowledge  representation  and  acquisition  technique  than  standard  pro¬ 
duction  rules  and  whether  providing  a  scenario  context  also  assisted  with  knowledge 
acquisition.  In  the  next  section  we  explain  how  MCRDR  are  used  in  our  training 
simulation  application  and  provide  the  results  of  a  study  showing  the  effieaey  of  using 
MCRDR.  We  conclude  with  future  work  and  summary. 

2  Acquiring  Knowledge  and  Experience 

While  the  training  environment  is  being  developed  to  assist  trainees  to  aequire  the 
domain  knowledge,  in  this  paper  we  are  focused  on  the  comprehensibility  of  the 
MCRDR  knowledge  representation  and  the  usability  of  the  KA  process  for  the  human 
domain  expert  who  will  train  the  system.  Two  key  features  of  MCRDR  which  we 
sought  to  evaluate  is  the  use  of  an  exception  structure  and  eases  to  motivate  and  vali¬ 
date  knowledge  acquisition  [3].  Looking  at  Fig.  1,  rule  1  was  added  in  response  to 
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In  our  approach,  the  domain  expert  can  interrupt  a  running  scenario  and  display 
its  attributes  (Fig.  2).  At  this  point  the  system  presents  its  conclusions  for  the  scenario 

based  on  the  RDR  KB,  ini¬ 
tially  contains  only  the  de¬ 
fault  rule  not  associated  with 
any  scenario.  The  expert  may 
then  either  agree  with  the 
conclusions  or  disagree  by 
adding  a  new  rule  into  the 
RDR  KB  which  will  then  be 
associated  with  that  scenario 
or  ease.  If  they  disagree,  they 
will  be  re -shown  the  attrib¬ 
utes  of  the  current  scenario, 
as  well  as  the  attributes  of  the 
scenario  associated  with  the 
rule  giving  the  ineorrect 
conclusion. 
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Fig.  2.  BOSS  screen  showing  a  case  being  popped  up 


3  Comprehensibility,  Usability  and  Usefulness  Study 

We  used  a  'Repeated  Measures'  design  with  two  within  subjects  factors  (Scenario, 
media/format)  and  one  between  subjects  factor  (stimuli  order).  In  the  present  study 
this  means  that  all  participants  received  Scenario  1:  no  luggage.  Scenario  2:  criminal 
conviction  and  No  Scenario  in  the  same  experimental  session.  Furthermore,  there 
were  two  possible  media/formats  for  presenting  the  scenarios:  the  virtual  training 
environment  whieh  involved  using  RDR  or  in  textual  format  leading  to  four  combina¬ 
tions  SI  RDR,  SI  TXT,  S2RDR,  S2TXT.  Each  participant  encountered  both  scenarios 
but  received  either  SIRDR  and  S2TXT  OR  S2RDR  and  SI  TXT.  In  the  virtual  training 
environment  (VTE)  exaetly  the  same  text  was  heard  and  read  as  in  the  text-only 
treatment.  It  was  our  goal  to  test  whether  experiencing  the  scenario  in  a  VTE  and 
using  the  RDR  knowledge  acquisition  method  in  that  environment  was  easier,  more 
natural  and  produced  better  rules/knowledge. 

Eaeh  participant  was  also  given  the  task  of  writing  some  production  rules  without 
the  use  of  any  secnario  to  provide  context.  Wc  called  this  treatment  COLD.  As  we 
were  dealing  with  noviecs  rather  than  experts,  domain  knowledge  was  provided  to 
participants  for  eaeh  task.  Participants  thus  acted  as  their  own  control  group.  Because 

of  the  increased  statistical  power  of 
the  'Within  Subjects'  experiment 
design,  fewer  participants  were 
required  to  draw  valid  conclusions. 

To  avoid  order  effects  we  altered 
the  order  of  receiving  stimuli.  To 
allocate  treatments  to  experimental 
units  we  used  a  Latin  squares  design 
which  controls  the  variation  and 
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estimates  the  main  effects  of  all  faetors  to  produce  the  orders  as  shown  in  Fig.  3.  A 
Latin  square  is  an  n  x  n  table  containing  n  different  symbols  in  such  a  way  that  eaeh 
symbol  oeeurs  exactly  onee  in  eaeh  row  and  exaetly  once  in  eaeh  eolumn  and  is  used 
in  experimental  designs  in  which  one  wishes  to  compare  treatments  and  to  control  for 
two  other  known  sourees  of  variation.  Three  people  were  assigned  to  eaeh  combina¬ 
tion  (i.e.  18  participants).  Following  eaeh  treatment  participants  were  asked  the  ques¬ 
tions  relevant  to  that  treatment  (see  Fig.  4). 


Scenario  I  -  passengers  with  no  luggage 

You  are  an  expert  in  airport  security.  W’alch/read  the  first  scenario.  Then  correct  the  system’s  conclusions 
afterwards  using  the  domain  knowledge  given  below. 

Knowledge  for  passengers  with  no  luggage 

1.  A  passenger  with  no  luggage  is  immediately  suspicious.  Customs  officers  are  advised  to  search  the 
passenger’s  clothes  and  body  and  consider  the  passenger  to  present  a  moderate  risk 

2.  If  the  passenger  is  only  staying  for  one  day,  they  present  a  low  risk  and  customs  officers  should  let  the 
passenger  through. 

Ripple  Down  Rules  Questions 

1 .  It  was  easy  to  understand  w  hat  the  scenario's  attributes  were. 

2.  1  found  it  easy  to  understand  what  the  system's  conclusion  was  and  how  to  disagree  with  it. 

3.  1  found  it  easy  to  understand  how  the  system  worked  out  its  conclusion. 

4.  1  found  it  easy  to  select  extra  categories  to  change  the  conclusion  for  the  scenario 

5.  Once  1  had  chosen  the  extra  categories,  1  found  it  easy  to  specify  what  the  new  conclusion  should  be. 

6.  The  user  manual  provided  was  important  in  helping  me  to  understand  how  to  use  the  system. 

Text  Questions  1.  1  understood  the  scenario  attributes 

2.  It  was  easy  to  write  rules  for  the  first  scenario  3.  It  was  easy  to  write  rules  lor  the  second  scenario 
Cold  Questions  1.  It  was  easy  to  write  the  first  rule 

2.  It  was  easy  to  write  more  rules  3.  It  was  easy  to  write  the  example 

Comparison  questions  1  .Which  task  did  you  find  the  easiest? 

2.  Which  task  did  you  tind  the  hardest? _ 3.  Winch  task  did  you  find  the  most  enjoyable? _ 


Fig.  4.  Sample  information  and  questions  in  our  study 


Participants  were  reeruited  aeross  eampus.  The  study  took  one  hour.  Participants 
comprised  9  males,  9  females,  aged  18-51,  average  age  22,  1 1  had  a  first  language 
other  than  English  (9  Chinese,  1  Indonesian,  1  Korean),  7  were  bom  in  Australia,  5 
had  lived  less  than  1  year  in  an  English  speaking  eountry,  half  had  played  computer 
games  for  5-12  years.  Descriptive  statistics  for  the  RI)R  questions  in  Fig,  4  are  pro¬ 
vided  in  Table  1.  We  found  no  signifieant  differenee  between  SI  and  S2  for  both  the 
RDR  and  text  treatments.  This  means  that  we  were  able  to  eombine  the  results  of  both 
scenarios  to  double  the  number  of  responses.  We  see  in  Fig.  5  that  RDR  was  found  to 

be  the  easiest  and  most  enjoyable.  Using  Median 
and  Mode  as  measures  of  ecntral  tendency  we 
see  that  participants  understood  how  to  interpret 
the  conditions,  rules  and  conclusions.  They  also 
found  entering  new  knowledge  easy. 

We  note  the  subjectivity  of  these  questions 
and  conducted  some  analysis  on  the  correctness 
of  the  rules.  Eaeh  rule  was  seored  using  the  fol¬ 
lowing  eriteria.  For  RDR  eaeh  rule  was  given  a 
seore  out  of  2  based  on;  1 )  whether  they  were 
able  to  write  a  rule;  2)  whether  the  rule  was 


enjoyable  write 


Fig.  5.  Results  of  comparative 
questions 
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detailed  enough  based  on  the  knowledge  given;  3)  whether  they  showed  the  features 
of  the  RDR  in  the  rule  (second  rule);  and  whether  the  rule  was  correct  based  on  the 
knowledge  given.  Text  rules  were  given  a  score  out  of  2  where  1  mark  concerned 
whether  the  risk  was  correctly  specified  and  1  mark  considered  if  the  agent  action  was 
correctly  specified.  Cold  rules  were  given  a  score  out  of  2  based  on  whether  the  rule 
fits  with  the  knowledge  provided;  if  it  fits  with  other  rules  (using  RDR  type  logic); 
number  of  extra  (irrelevant)  rules.  For  each  participant  a  score  based  on  the  rank  order 
of  treatments  was  given,  with  3  the  highest  rank.  The  results  are  given  in  Table  2. 
From  the  scoring  process,  and  supported  in  the  results,  the  cold  treatment  rules  tended 
to  lack  structure,  consistency  and  relevant  content.  Providing  the  context  of  a  scenario 
in  the  text  treatment  was  obviously  helpful  and  produced  the  best  rules  in  this  study. 
The  difference  in  scores  for  text  and  cold  rules  was  statistically  significant  (p=  0.037). 
While  text  ranked  highest,  we  expect  that  the  benefits  of  RDR  for  consistency  and 
relevance  to  the  case  would  be  better  demonstrated  in  the  longer  term  (even  after  a 
day  rather  than  just  20  minutes  of  usage)  and  after  some  training. 


Table  1.  Descriptive  stats  for  Likert  responses  to  RDR  Qsl-6 


Key:  5=Strongly  Agree.  4= Agree,  3=Ncutral,  2=Disagree,  l^Strongly  Disagree. 
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4RDR 
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1.059 
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5RDR 

3.889 
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4 
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0.211 
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2 

5 

6RDR 

4.167 

0.167 

4 

4 

0.707 

0.500 

0.776 

-0.250 

2 

3 

5 

Table  2.  Comparison  of  correctness  (top  score/ave  possible  is  3) 


Anova:  Single  Factor 

Groups 

Count  Sum  Ave 

Variance 

RDR 

18 

37  2.0556 

0.6438 

Text 

18 

44  2.4444 

0.6144 

Cold 

18 

34  1.8889 

0.5752 

Var  Source 

SS 

df  MS 

F  P-val 

F  crit 

Btween  Grps  2.926 

2  1.4630 

2.3939  0.102 

3.1788 

Within  Grps 

31.167 

51  0.6111 

Total 

34.093 

53 

We  found  a  highly 
statically  significant 
difference  for  RDR 
(p=  0.005)  and  cold 
(p=0.0049)  treatments 
in  the  responses  acc¬ 
ording  to  the  order  in 
which  the  treatment 
"occurred.  In  general 
we  can  say  the  first  treatment  will  do  worse  than  the  same  treatment  when  done  sec¬ 
ond  or  third  This  finding  makes  sense  given  that  experience  with  any  task  is  likely  to 
improve  performance.  Similarly,  the  time  taken  to  perform  each  task  was  significantly 
affected  by  the  order  in  which  the  task  occurred,  regardless  of  the  treatment.  See  in 
Table  3  that  RDR  took  much  longer  than  the  other  tasks  which  were  due  to  the  need 
to  read  and  refer  to  the  user  manual.  Note  that  spending  more  time  engaged  in  a  train¬ 
ing  task  is  in  general  beneficial  for  learning.  In  performing  a  correlation  between  the 
questions  across  tasks  we  find  a  positive  correlation  between  the  question  about  the 
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hardest  task  with  questions  about  understanding  the  attnbutes  in  the  Text  task  (0.856) 
and  in  the  RDR  task  (0.701). 

Table  3.  Comparison  of  time  to  conduct  each  task 


Anova:  Single  Factor 

We  also  find 
a  medium  to 

Groups 

Count 

Sum 

Average 

Variance 

RDR  time 

18 

7.10611 

0.394784 

0.029359 

strong  pos¬ 

text  time 

18 

2  45088 

0.13616 

0  006144 

itive  correl¬ 

Cold  time 

18 

2.46188 

0.136771 

0.004458 

ation  (0.775, 

Vor 

Source 

SS 

Df 

MS 

F  P-value 

F  crit 

0.778,  0.712) 
to  the  ques¬ 
tion  on  which 

Btween 

0  80075 

2 

0.400372 

30.05634  2  38E-09 

3.178 

grps 

Within 

grps 

Total 

0.67936 

51 

0.013321 

task  was 

most  enjoy¬ 

1.48010 

53 

able  and  Q2, 

4  and  5  for 


the  RDR  task.  On  a  F-Tcst  Two-Sample  for  Variances  with  CF  95%,  there  was  a  statis¬ 
tically  significant  difference  in  the  responses  of  participants  who  played  computer 
games  for  more  than  5  years  with  those  with  less  gaming  experience  giving  the  higher 
scores. 

When  determining  if  there  was  a  significant  difference  in  perceived  difficulty  of 
adding  the  first  rule  and  subsequent  consistent  rules  (a  claimed  strength  of  RDR) 
using  one-tail  t-Test:  Paired  Two  Sample  for  Means  we  found  a  significant  difference 
(p=0.036)  showing  that  participants  found  adding  additional  rules  more  difficult  than 
the  first  rule  for  the  cold  but  not  for  the  text  treatment. 

4  Further  Considerations 

Gaines  [1]  proposed  the  use  of  an  exception  directed  acyclic  graph  to  measure  the 
comprehensibility  of  production  rules,  decision  trees  and  rules  with  exceptions.  The 
approach  computes  complexity  based  on  the  number  of  the  nodes  (N),  final/end  nodes 
(F),  arcs/edges  (A),  Excess  (E=A+V-N),  clauses  (C),  where  complexity  X  = 
(N+2E+2C)/5.  Sugiura,  Riesenhuber  and  Koseki  [7]  also  offer  the  measures  of  table 
size,  similarity  of  concept  function,  continuity  in  attributes  with  ordinal  values  and 
conformity  between  concept  functions  and  real  cases  to  determine  comprehensibility 
of  tabular  knowledge  bases.  To  potentially  support  better  comparison  with  the  RDR 
rules,  we  could  apply  some  parsing  techniques  from  language  processing  (such  as  link 
grammar)  to  generate  more  structured  output  from  the  text  and  cold  rules.  In  contrast, 
RDR  are  structured,  linked  to  cases  and  linked  to  scenarios  making  evolution  and 
grow  th  of  the  KB  possible  and  supporting  Sugiura  et  af  s  (1993)  criteria  of  conformity 
between  concept  functions  and  realistic  cases. 

We  considered  asking  our  participants  to  represent  their  knowledge  using  a  logic- 
based  formalism,  decision  trees  or  decision  tables,  however,  we  did  not  believe 
that  an  untrained  person  would  be  able  to  write  valid  logic  statements,  decision 
tables  or  trees.  We  note  that  RDR  can  be  transformed  into  a  propositional  or  FOL  [4] 
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representation  and  also  into  unambiguous  decision  tables  [I].  We  felt  that  asking 
participants  to  write  statements  in  the  IF-THEN  format  (ic  production  rules)  was 
something  we  could  expect  an  average  person  to  be  able  to  achieve.  Furthermore,  in 
our  study  we  tested  the  role  that  context  in  the  form  of  cases  plays  in  assisting  the 
knowledge  acquisition  task  and  thus  we  provided  cases  for  the  text-based  scenarios 
(which  were  the  same  as  the  cases  used  in  the  RDR  condition)  and  compared  the 
difficulty  of  writing  rules  when  this  context  is  not  known  or  specified. 
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Abstract.  Large-scale  comparable  corpora  became  more  abundant  and  accessi¬ 
ble  than  parallel  corpora,  with  the  explosive  growth  of  the  World  Wide  Web. 
Therefore,  strategies  on  bilingual  terminology  extraction  from  comparable  texts 
must  be  given  more  attention  in  order  to  enrich  existing  bilingual  lexicons  and 
thesauri  and  to  enhance  Cross-Language  Information  Retrieval.  In  the  present 
paper,  wc  focus  on  the  enhancement  of  Cross-Language  Information  Retrieval 
using  a  two-stage  corpus-based  translation  model  that  includes  bi-directional 
extraction  of  bilingual  terminology  from  comparable  corpora  and  selection  of 
best  translation  alternatives  on  the  basis  of  their  morphological  knowledge  The 
impact  of  comparable  corpora  on  the  performance  of  the  Cross-Language  In¬ 
formation  Retrieval  process  is  evaluated  in  this  study  and  the  results  indicate 
that  the  effect  is  clearly  positive,  especially  when  using  the  linear  combination 
with  bilingual  dictionaries  and  Japanesc-English  pair  of  languages. 

Keywords:  Cross-language  information  retrieval,  comparable  corpora,  similar¬ 
ity,  co-occurrenec  tendency. 


1  Introduction 

This  paper  intends  to  bring  solutions  to  the  problem  of  lexical  coverage  of  existing 
bilingual  dictionaries  but  also  to  the  improvement  of  the  performance  of  CLIR.  The 
main  contributions  concern  the  enhancement  of  CLIR  by  an  automatic  acquisition  of 
bilingual  terminology  from  comparable  corpora  that  will  help  cope  with  the  limitation 
of  CLIR,  especially  in  the  query  disambiguation  process  as  well  as  during  the  query 
expansion  with  related  terms.  Furthermore,  this  study  could  be  valuable  for  the  ex¬ 
traction  of  unknown  words  and  their  translation  and  thus  the  enrichment  and  en¬ 
hancement  of  bilingual  dictionaries.  Therefore,  we  present  in  this  paper  an  approach 
of  learning  bilingual  terminology  from  textual  resources  other  than  bilingual  diction¬ 
aries,  such  as  comparable  corpora  and  evaluations  on  CLIR.  First,  we  propose  a  two- 
stage  corpus-based  translation  model  for  the  acquisition  of  bilingual  terminology  from 
comparable  corpora.  The  first  stage  concerns  the  extraction  of  bilingual  translations 
from  the  source  language  to  the  target  language,  also  from  the  target  language  to  the 
source  language.  The  two  results  are  combined  for  the  purpose  of  disambiguation.  In 
the  second  stage,  the  extracted  translation  alternatives  are  filtered  on  the  basis  of  their 
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morphological  knowledge.  A  linguistics-based  pruning  technique  is  applied  in  order 
to  compare  source  words  and  their  target  language  translation  equivalents  on  the  basis 
of  their  part  of  speeeh  tags.  Furthermore,  we  present  a  combined  translation  model 
involving  the  comparable  corpora  and  readily  available  bilingual  dictionaries.  In  our 
evaluations,  vve  used  a  large-scale  test  collection  on  Japanese-English  and  different 
weighting  schemes  of  SMART  retrieval  system  and  confirmed  the  effectiveness  of  the 
proposed  translation  model  in  CL1R. 

The  remainder  of  the  present  paper  is  organized  as  follows:  Seetion  2  presents  an 
overview  of  the  proposed  model.  Section  3  presents  the  two-stage  corpus-based  trans¬ 
lation  model.  Section  4  introduces  a  combination  of  different  translation  models. 
Experiments  and  evaluations  in  CL1R  are  related  in  Section  5.  Section  6  concludes  the 
present  paper. 


2  An  Overview  of  the  Proposed  Model 

Throughout  this  paper  we  will  seek  to  exploit  and  explore  benefits  from  collections  of 
news  articles  for  the  acquisition  of  bilingual  terminology,  in  order  to  enrieh  existing 
multilingual  lexieal  resources  and  help  cross  the  language  barrier  for  information 
retrieval.  We  rely  on  such  comparable  corpora  for  the  extraction  of  bilingual  termi¬ 
nology,  in  the  form  of  translations  and/or  expansion  terms,  i.e.  words  that  will  help 
the  query  expansion  in  CL1R.  The  task  of  bilingual  terminology  extraction  is  accom¬ 
plished  by  a  two-stage  corpus-based  translation  model,  which  is  described  in  detail  in 
Section  3.  A  linear  combination  involving  the  comparable  corpora  and  bilingual  dic¬ 
tionaries  is  completed  in  order  to  select  best  translation  candidates  of  the  source  terms 
of  a  given  query.  Finally,  documents  are  retrieved  in  the  target  language. 


3  Two-Stage  Corpus-Based  Translation  Model 

A  two-stage  eorpus-based  translation  model  (Sadat  et  ah,  2003a;  Sadat  et  ah,  2003b; 
Sadat  et  ah,  20()3e),  which  is  based  on  the  symmetrical  criterion  in  addition  to  the 
assumption  of  similar  colloeation,  aims  to  find  translations  of  the  source  word  in  the 
target  language  eorpus  but  also  translations  of  the  target  words  in  the  source  language 
corpus.  Linguistic  resources  were  used  in  the  two-stage  corpus-based  translation 
model,  as  follows:  (i)  a  collection  of  news  articles  from  Mainichi  Newspapers  (1998- 
1999)  for  Japanese  and  Mainichi  Daily  News  (1998-1999)  for  English  were  consid¬ 
ered  as  comparable  corpora,  because  of  their  common  feature  on  the  time  period. 
Documents  of  NTCIR-2  test  collection  were  also  considered  as  comparable  corpora  in 
order  to  cope  with  special  features  of  the  test  collection  during  evaluations;  (ii)  mor¬ 
phological  analyzers,  ChaSen  version  2.2.9  (Matsumoto  et  ah,  1997)  for  texts  in  Japa¬ 
nese  and  OAK  (Sekine,  2001 )  for  English  texts  were  used  in  linguistic  processing;  (iii) 
EDR  (1996)  and  EDICT1  bilingual  Japanese-English  and  English -Japanese  dictionar¬ 
ies  were  considered  in  the  translation  of  context  veetors  of  souree  and  target  lan¬ 
guages.  Japanese  words  written  in  Katakana  representing  foreign  words  and  proper 
names,  that  were  not  found  in  the  bilingual  dictionaries  were  manually  translated.  A 

1  http://www.csse.monash.edu.au/-jwh/wwwjdic.html 
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transliteration  process  could  be  used  in  order  to  convert  those  words  to  their  English 
equivalence. 

3.1  First  Stage  in  the  Proposed  Translation  Model 

The  two-stage  corpus-based  translation  model  for  the  acquisition  of  bilingual  termi¬ 
nology  is  described  as  follows: 

1.  A  simple  bilingual  terminology  acquisition  from  source  language  to  target 
language  to  yield  a  first  simple  translation  model  represented  by  similarity 
vectors  S1Ms~>t- 

2.  A  simple  bilingual  terminology  acquisition  from  target  language  to  source 
language  to  yield  a  second  simple  translation  model  represented  by  similarity 
vectors  SIM  T 

3.  Merge  the  first  and  second  models  to  yield  a  two-stage  translation  model, 
based  on  bi-directional  comparable  corpora  and  represented  by  similarity 
vectors  SIMs<->t- 

The  simple  approach  for  bilingual  terminology  acquisition  from  comparable  corpora 
is  based  on  the  assumption  of  similar  collocation,  i.e.,  If  two  words  are  mutual  trans¬ 
lations,  then  their  most  frequent  collocates  are  likely  to  be  mutual  translations  as  well. 
We  follow  strategies  of  previous  researches  (Dejean  et  al.,  2002;  Fung,  2000;  Rapp, 
1999;  Sadat  ct  al.,  2003a;  Sadat  et  al.,  2003b,  Sadat  et  al.,  2003c). 

In  further  sections,  we  name  the  simple  approach  for  bilingual  terminology  acquisi¬ 
tion  from  comparable  corpora  as  simple  corpus-based  translation  and  the  translation 
model  representing  the  first  stage  of  the  two-stage  corpus-based  translation  as  bi¬ 
directional  corpus-based  translation . 

3.2  Second  Stage  in  the  Proposed  Translation  Model 

Combining  linguistic  and  statistical  methods  is  becoming  increasingly  common  in 
computational  linguistics,  especially  as  more  corpora  become  available  (Klavens  & 
T/oukermann,  1996;  Sadat  et  al.,  2003c).  We  propose  to  integrate  linguistic  concepts 
into  the  corpus-based  translation  model.  Morphological  knowledge  such  as  Part-of- 
Speech  (POS)  tags,  context  of  terms,  etc.,  could  be  valuable  to  filter  and  prune  the 
extracted  translation  candidates.  The  objective  of  the  linguistics-based  pruning  tech¬ 
nique  is  the  detection  of  terms  and  their  translations  that  are  morphologically  close 
enough,  i.e.,  close  or  similar  POS  tags.  This  proposed  approach  will  select  a  fixed 
number  of  equivalents  from  the  set  of  extracted  target  translation  alternatives  that 
match  the  Part-of-Speech  of  the  source  term.  Japanese  foreign  words  were  not  pruned 
with  the  proposed  linguistics-based  technique  but  could  be  treated  via  transliteration , 
i.e.,  conversion  of  Japanese  katakana  to  their  English  equivalence  or  to  the  alphabeti¬ 
cal  description  of  their  pronunciation  (Knight  &  Graehl,  1998).  Finally,  the  generated 
translation  alternatives  are  sorted  in  decreasing  order  by  similarity  values.  Rank 
counts  are  assigned  in  increasing  order,  starting  at  1  for  the  first  sorted  list  item.  A 
fixed  number  of  top-ranked  translation  alternatives  are  selected  and  misleading  candi¬ 
dates  are  discarded. 
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4  Combining  Different  Translation  Models 

Combining  different  translation  models  has  showed  sueeess  in  previous  research  (Dc- 
jean  et  ah,  2002).  We  propose  a  combined  translation  model  involving  comparable 
corpora  and  readily  available  bilingual  dictionaries.  The  proposed  dietionary-based 
translation  model  is  derived  directly  from  readily  available  bilingual  dictionaries,  by 
considering  all  translation  candidates  of  eaeh  source  entry  as  cquiprobable,  to  yield  a 
probabilistic  translation  model  P^t\s).  The  linear  combination  will  involve  the  two 
probabilistic  translation  models  Pj(t\.s)  and  Pj(t\s)  derived  from  the  comparable 
corpora  (either  the  simple  or  the  two-stage  model)  and  readily  available  bilingual 
dictionaries. 


5  Evaluation  and  Experiments 

We  considered  the  set  of  news  articles  as  well  as  the  abstracts  of  NTCIR-2  test  collec¬ 
tion  as  comparable  corpora  for  Japanese-English  language  pairs.  Content  words 
(nouns,  verbs,  adjectives,  adverbs  and  Foreign  words)  were  extracted  from  English 
and  Japanese  corpora.  Context  vectors  were  constructed  for  13,552,481  Japanese 
terms  and  1,517,281  English  terms.  Similarity  vectors  were  constructed  for 
96,895,255  (Japanese,  English)  pairs  of  terms  and  92,765,129  (English,  Japanese) 
pairs  of  terms.  Bi-directional  similarity  vectors  (after  merging  and  disambiguation) 
resulted  in  58,254,841  (Japanese,  English)  pairs  of  terms.  SMART  information  re¬ 
trieval  system  (vSalton,  1971),  which  is  based  on  vector  model,  was  used  to  retrieve 
English  documents.  We  used  the  monolingual  English  runs,  i.c.,  English  queries  to 
retrieve  English  documents  and  the  bilingual  Japanese-English  runs,  i.e.,  Japanese 
queries  to  retrieve  English  documents. Bilingual  translations  were  extracted  from  the 
collection  of  news  articles  using  the  simple  translation  model  and  the  two-stage  trans¬ 
lation  model.  A  fixed  number  p  (set  to  five)  of  top-  ranked  translation  alternatives  was 
retained  for  evaluations  in  CLIR.  Results  and  performances  on  the  monolingual  run  as 
well  as  on  the  bilingual  runs  using  the  two-stage  corpus-based  translation  model  and 
the  linear  combination  to  bilingual  dictionaries  are  illustrated  in  Table  1.  Evaluations 
are  based  on  the  average  precision,  differences  in  term  of  average  precision  of  the 
monolingual  counterpart  and  the  improvement  over  the  monolingual  counterpart. 

Retrieval  methods  are  represented  by  the  monolingual  retrieval  Mono ,  dictionary- 
based  translation  DT ,  the  simple  corpus-based  translation  model  SCT ,  the  bidirectional 
eorpus-based  translation  model  BCTt  the  two-stage  corpus-based  translation  model 
TCT.  Linear  combinations  were  represented  by  SCT+DT,  BCT+DT  and  TCT+DT 

Table  1.  Evaluations  of  the  proposed  and  combined  translation  models 


Average  Precision,  and  %  Monolingual  (P=5) 


Mono 

DT 

SCT 

BCT 

TCT 

SCT+DT 

BCT+DT 

TCT+DT 

0.3368 

(100%) 

0.2279 

(67.66%) 

0.1417 

(42.07%) 

0.1801 

(53.47%) 

0.2008 

(59.62%) 

0.2366 

(70.25%) 

0.2721 

(80.79%) 

0.2987 

(88.69%) 
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As  illustrated  in  Table  1,  combining  different  translation  models  yields  a  signifi¬ 
cantly  better  result  than  using  eaeh  model  by  itself  Translation  models  based  on 
comparable  corpora  and  bilingual  dictionaries  have  completed  each  other  and  their 
linear  combination  has  provided  a  valuable  resource  for  query  translation/expansion 
in  CL1R  and  has  allowed  an  improvement  in  the  effectiveness  of  information 
retrieval. 

6  Conclusion 

In  the  present  paper,  we  investigated  the  approach  of  extracting  bilingual  terminology 
from  comparable  corpora  in  order  to  enhance  CL1R,  especially  in  the  disambiguation 
and  query  expansion  processes,  and  possibly  enrich  existing  bilingual  lexicons.  We 
proposed  a  two-stage  corpus-based  translation  model  consisting  of  bi-directional 
extraction  of  bilingual  terminology  and  linguistic-based  pruning.  Among  the  draw¬ 
backs  of  the  proposed  translation  process  is  the  introduction  of  many  noisy  terms  or 
wrongly  translated  terms;  however,  most  of  those  terms  could  be  considered  as  effi¬ 
cient  for  the  query  expansion  in  CUR  but  not  for  the  translation. 

Combination  of  two-stage  corpus-based  translation  model  and  bilingual  dictionar¬ 
ies  yields  to  better  translations  and  an  effectiveness  of  information  retrieval  could  be 
achieved  across  Japanese  and  English  languages. 
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Abstract.  The  customer  churn  problem  affects  hugely  the  telecommu¬ 
nication  services  in  particular,  and  businesses  in  general.  Note  that  in 
majority  of  cases  l  he  number  of  potential  customer  churn  is  much  smaller 
than  the  noii-cliurners.  Therefore,  the  imbalance  distribution  of  samples 
between  clnirners  and  non-churners  is  a  concern  when  building  a  churn 
prediction  model.  This  paper  presents  a  Local  PCA  approach  to  solve 
imbalance  classification  problem  by  generating  new  churn  samples.  The 
experiments  were  carried  out  on  a  large  real- world  Telecommunication 
dataset  and  assessed  on  a  churn  prediction  task  The  experiments  showed 
that  the  Local  PCA  along  with  Smote  outperformed  Linear  regression 
and  Standard  PCA  data  generation  techniques. 

Keywords:  PCA.  Imbalanced  Classification,  Churn  Prediction. 


1  Introduction 

Customer  Churn  has  become  a  serious  problem  for  companies  mainly  in  telecom- 
lnunication  industry.  This  is  as  a  result  of  recent  changes  in  the  telecommuni¬ 
cations  industry,  such  as,  new  services  and  the  liberalisation  of  the  market.  In 
recent  years,  Data  Mining  techniques  have  emerged  as  one  of  the  method  to 
tackle  the  Customer  Clmrn  problem[l,8]. 

The  study  of  customer  churn  can  be  seen  as  a  classification  problem  (Churn 
and  Non-Churn  classes).  The  main  goal  is  to  build  a  robust  classifier  to  predict 
potential  churn  customers.  However,  imbalanced  distribution  of  class  samples  is 
an  issue  in  data  mining  as  it  leads  to  poor  classification  results[4].  In  this  paper, 
we  focus  on  overcoming  this  problem  by  increasing  the  size  of  churn  samples 
by  an  over-sampling  approach.  The  aim  is  to  correctly  set  the  distribution  sam¬ 
ples  to  build  an  optimal  classifier  by  adding  minority  class  samples.  There  have 
been  various  sampling  approaches  proposed  to  counter  non-heuristic  sampling 
problems. 

Synthetic  Minority  Over- sampling  Technique  (Smote)  [3]  generates  artificial 
data  along  the  line  between  minor  class  samples  and  K  minority  class  nearest 
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neighbours.  This  causes  the  decision  boundaries  for  the  minor  class  space  to 
spread  further  into  majority  class  space.  An  extended  approach  of  Smote  is 
Smote  -f  Edited  Nearest  Neighbour  (ENN)[9]  approach,  which  removes  more 
unnecessary  samples  and  provides  a  more  in  depth  data  cleaning. 

The  main  idea  of  our  approach  is  to  form  a  new  minority  class  space  by 
generating  minority  class  data  using  the  K-means  algorithm  with  PCA[7].  PCA 
reveals  the  internal  structure  of  a  dataset  by  extracting  uncorrelated  variables 
known  as  Principal  Components  (PC).  In  this  paper,  we  adopt  Local  PCA  data 
regression  to  generate  new  dataset  and  add  raw  data  to  change  the  distribution 
of  class  samples. 

This  paper  is  organised  as  follows:  the  next  section  outlines  the  proposed 
approach  on  churn  prediction  task.  Section  3  explains  experiments  and  the  eval¬ 
uation  criteria.  We  conclude  and  highlight  some  key  remarks  in  Section  I. 

2  Approaches 

Our  proposed  approach  combines  PCA  technique,  the  Genetic  Algorithm  (GA) 
and  K-means  algorithm  to  generate  a  new  data  for  minority  class.  First  and 
foremost,  minority  class  dataset  dcinnn  is  formed  from  the  original  raw  dataset 
draw-  The  GA  K-means  clustering  is  applied  on  dcflurn  to  form  A  clusters.  The 
next  step  is  to  apply  PCA  regression  on  each  cluster  set  to  transform  them 
back  to  original  feature  space  in  terms  of  selected  principal  component.  Wc  be¬ 
lieve  that  applying  regression  locally  would  avoid  the  inclusion  of  redundant 
information  in  principal  component  because  of  lower  variance  within  the  clus¬ 
ters.  These  transformed  data  is  then  added  to  draw  to  improve  the  distribu¬ 
tion  of  minority  class  samples.  Finally,  draw  is  used  to  build  a  churn  predic¬ 
tion  model  for  a  classification  purpose.  Figure  1  shows  the  main  steps  of  the 
approach. 
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Fig.  1.  The  description  of  the  proposed  approach 
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2.1  GA  K-Means  Clustering  Algorithm 

The  standard  K-means  algorithm  is  sensitive  to  the  initial  centroid  and  poor 
initial  cluster  centres  would  lead  to  poor  cluster  formation.  We  employ  Genetic 
Algorithm  (GA)[5]  to  avoid  sensitivity  problem  in  centroid  selection. 

In  GA  K-means  algorithm,  a  gene  represents  a  cluster  centre  and  a  chromo¬ 
some  of  K  genes  represents  a  set  of  K  cluster  centres.  The  G  A  K-means  algorithm 
steps  are:  ^Initialization: Randomly  select  K  points  as  cluster  centres  (chromo¬ 
somes)  from  original  data  set  and  apply  k-means,  2)Selection:The  chromosomes 
are  selected  according  to  specific  selection  method  3)  Crossover.Selected  chro¬ 
mosomes  are  randomly  paired  with  other  parents  for  reproduction,  4)  Mutation: 
Apply  mutation  operation  to  ensure  diversity  in  the  population,  5)  Elitism:Store 
the  chromosome  that  has  the  best  fitness  value  in  each  generation  and  0)  Itcra- 
tion:Go  to  step  2,  until  the  variation  of  fitness  value  within  the  best  chromosomes 
is  less  than  a  specific  threshold. 


2.2  Linear  PC  A  and  Data  Generation 


We  apply  the  PCA  regression  technique  on  each  cluster  to  generate  a  new  dataset 
in  original  feature  space  in  terms  of  selected  principal  components  (PC).  PCA 
has  a  property  of  searching  for  PC  that  accounts  for  large  part  of  total  variance 
in  the  data  and  projecting  data  linearly  onto  new  orthogonal  bases  using  PC. 

Consider  a  dataset  X={xi,i  =  1,2, _ N,  xt  €  }  with  attribute  size  of  d 

and  N  samples.  The  data  is  st  andardised  so  that  the  standard  deviation  and  the 
mean  of  each  column  arc  1  and  0,  respectively.  PC  can  he  extracted  by  solving 
the  following  Eigenvalue  Decomposition  Problem  [7]. 


A  a  =  Co,  subject  to  ||cv|  I2 


1 

A 


(i) 


where  o  is  the  eigenvectors  and  C  is  the  covariance  matrix.  After  solving  the 
equation  (1),  sort  the  eigenvalues  in  descending  order  as  larger  eigenvalue  gives 
significant  PC.  Assume  that  matrix  a  contains  only  a  selected  number  of  eigen¬ 
vectors  (PC).  The  transformed  data  is  computed  by 


Xtr  =  <*7  X1 


(2) 


From  equation  (2),  matrix  X1  can  he  obtained  by  X1  =  a7  ]  X jr.  Finally,  the 
matrix  A  T  is  transposed  again  to  got  the  matrix  Xnew .  Since  we  standardised 
the  data  in  the  first  step,  the  original  standard  deviation  and  the  mean  of  each 
column  must  be  included  in  each  X- yw.  The  newly  generated  data  Xnew  is  then 
added  to  draw  to  adjust  the  distribution  of  the  samples.  We  continue  this  process 
until  all  clusters  are  transformed. 

We  rim  two  data  generation  approaches  on  PCA  regression.  The  first  approach 
utilises  all  clusters  on  PCA  regression  (Locall).  The  second  approach  only  uses 
the  centroid  of  each  cluster  to  form  a  dataset  of  centre  points  (Local2)  and  use 
this  data  to  extract  principal  components. 
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We  selected  randomly  139,000  customers  from  a  real  world  database  provided  by 
Eircom.  The  distribution  of  churn  and  non-churn  is  imbalanced  as  the  training 
and  testing  data  contain  6000  (rosp.  2000)  chiirners  and  91000  (resp.  37000) 
non-churners,  respectively.  These  datasets  are  described  by  122  features*  which 
are  explained  in  [6], 

We  implement  the  Decision  Tree  C4.5  (DT),  the  SVM,  Logistic  Regression 
(LR)  and  the  Naive  Bayes  (NB)  to  build  prediction  models.  We  performed  these 
models  following  the  evaluation  criteria:!)  The  true  churn  (TP)  is  the  ratio  of 
churn  that  was  classified  correctly  and  2)  the  false  churn  (FP)  is  the  ratio  of  non- 
churn  that  was  incorrectly  classified  as  churn.  A  good  solution  is  considered  as 
dominant  when  TP  is  high  and  FP  is  low.  Wo  use  the  Receiver  Operating  Curve 
technique  (ROC)  to  evaluate  the  various  learning  algorithms.  It  was  shown  how 
that  t ho  TP  varies  with  FP.  In  addition,  the  Area  under  ROC  curve  (AUC)[2] 
provides  single  number  summary  for  the  performance  of  learning  algorithms.  We 
calculate  the  AUC  threshold  on  FP  as  0.5  as  telecom  companies  art?  generally 
not  interested  in  FP  above  50%. 

3.2  Experimental  Setup 

The  main  objective  of  the  experiments  is  to  observe  if  additional  clmrner  samples 
generated  by  PCA  regression  would  improve  churn  prediction  results.  We  first 
examines  the  optimal  cluster  size  of  the  GA  k  means  algorithm  for  Local  PCA 
regression  by  setting  K  to  be  in  [4  —  72].  The  second  experiment  compares  the 
prediction  results  of  each  classifier  by  PCA  regression  from  experiment  1  to  Lin¬ 
ear  Regression (LiR), Standard  PCA  based  data  generation  and  Smote.  The  final 
experiment-  examines  the  main  objective.  The  number  of  ehurners  is  increased 
from  the  original  size  of  6000  up  to  30000  by  setting  the  PC  threshold  to  0.9. 
0.8,  ....  0.6.  A  new  dataset  is  generated  based  on  Locall  &  Loc:al2  generation 
method  for  all  experiments. 


3.3  Results  and  Discussion 

The  range  of  cluster  size  K ,  [36:72],  produced  better  AUC  results  than  smaller 
K .  In  addition.  GA  K -means  performed  generally  better  than  standard  K-nieans. 

The  FP  and  TP  rates  of  3  data  regression  methods  and  Smote  were  compared 
in  Figure  2.  For  local  PCA,  we  selected  2  best  cluster  sizes  from  range  [36:72] 
for  each  classifier.  The  standard  PCA  operates  similarly  to  local  PCA  blit  the 
clustering  technique  is  not  applied  on  churn  data.  For  all  classifiers,  both  types 
of  PCA  regression  performed  as  good  as  Smote  and  better  than  other  methods 
except  C  1.5,  as  it  is  hard  to  conclude  which  method  is  better. 

The  third  experiment  overall  results  are  illustrated  in  Figure  3.  The  Figure 
presents  the  graphs  of  AUC  against  the  churn  size  for  each  classifier  using  Local  I 


672 


T.  Sato  et  al. 


I09MC  ■  Method* 


FiMm  Chum  (%) 


(a)  Logistic  Regression 


Mm  Chum  (%) 


(b)  Decision  Tree 


Fig.  2.  ROC  graph:  Comparison  of  Local  PCA,  Standard  PCA  and  Linear  Regression 
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Fig.  3.  AUC  Plot 


data  generation  as  it  gave  the  best  prediction  results  in  experiment  2.  From  the 
churn  size  of  6000  onward,  additional  churn  samples  generated  by  PCA  were 
added.  The  SVM,  XB  and  LR  performed  well  with  size  6000  to  12000  hilt  they 
did  not  produce  acceptable  TP  or  FP  rates  afterwards  as  this  ran  be  easily  seen 
from  Figure  3. 
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In  summary,  the  experiments  showed  that  1)  The  clustering  size  I\  did  pro¬ 
duce  different  AUC  results  according  to  the  size.  2)  The  local  PCA  data  regres¬ 
sion  performed  better  than  Standard  PCA  and  LiR  and  finally  3)  Adding  similar 
churn  samples  to  original  data  improved  the  TP  rate  for  most  of  the  classifiers. 
However,  FP  reached  over  50%  after  12000.  One  of  the  reasons  for  the  high  FP 
rate  is  due  to  the  change  in  decision  boundaries.  More  11011-churn  samples  inside 
enlarged  churn  space  lead  to  high  number  of  incorrectly  classified  non-ehurn. 


4  Conclusion  and  Future  Works 

In  this  paper,  we  have  designed  PCA  regression  method  locally  in  combination 
with  GA  K-rneans  algorithm  to  generate  churn  class  samples  in  anticipation  to 
solve  Imbalance  classification  problem. 

The  approach  was  tested  on  a  telecommunication  data  on  churn  prediction 
task.  The  results  showed  that  the  Local  PCA  along  with  Smote  performs  better 
than  Standard  PCA  and  LiR  in  general.  Additional  samples  would  improve  TP 
rate  for  churn  size  [6000:12000]  but  the  FP  rate  would  increase  over  50%.  Since 
we  are  more  interested  in  identifying  potential  churner  as  losing  a  client  has 
significant  effect  for  the  telecom  company,  improvement  in  TP  is  a  good  results. 
Nevertheless,  FP  rate*  must  be  limited  as  high  FP  can  be  expensive  for  future 
marketing  campaign.  We  are  interested  in  understanding  as  to  why  additional 
churn  samples  would  give  high  FP.  There  is  a  possibility  that,  the  churn  data 
generated  by  various  PC  thresholds  can  lead  to  poor  classification  in  FP. 
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Abstract.  Recent  interests  in  inultiagent  dynamic  decision  modeling  in 
partially  observable  multiagent  environments  have  led  to  the  develop¬ 
ment  of  several  representation  and  inference  methods.  However,  these 
methods  have  limited  application  under  time-critical  conditions  where 
a  trade-off  between  model  quality  and  computational  tractability  is  es¬ 
sential.  We  present  a  formal  representation  for  modeling  time-critical 
mnltiagent  dynamic  decision  problems  through  interactive  dynamic  in¬ 
fluence  diagrams.  The  proposed  model,  called  interactive  time-critical 
dynamic  influence  diagrams,  has  the  ability  to  represent  space- temporal 
abstraction  in  multiagent  dynamic  decision  models.  More  importantly, 
we  take  the  notion  of  object-orientation  design  which  facilitates  the  self- 
expansion  and  self-compression  in  the  model  implementation. 

Keywords:  Time-Critical  Decision  Making,  Mnltiagent  Systems,  Model 
Construction. 


1  Introduction 

There  is  a  growing  line  of  interest  for  addressing  single  agent  time-critical  dy¬ 
namic  decision  problems  [3,7].  Time-critical  decision  modeling  is  more  significant 
for  mnltiagent  applications  due  to  the  complex  decision  process  and  solutions. 
Our  interest  in  time-critical  mnltiagent  systems  is  motivated  by  the  emergence 
of  several  applications  including  anti- air  defense  domain  [5],  Roboeup  [4]  and 
multi-player  online  games  [8].  Additionally,  a  suitable  set  of  time-critical  deci¬ 
sion  making  techniques  would  allow  multiple  agents  to  coordinate  their  actions 
within  a  time  limit  so  that  individual  rational  actions  do  not  adversely  affect 
the  overall  system  efficiency  [1]. 

The  purpose  of  this  paper  is  to  present  a  form  technique,  called  Interactive 
time-critical  dynamic  influence  diagrams  (I-TCDIDs)  for  modeling  multiagent 
time-critical  dynamic  decision  problems.  We  rest  on  the  representation  of  inter¬ 
active  dynamic  influence  diagrams  (I-DIDs)  [2],  and  further  formalize  I- DID  by 
providing  time-index  for  each  node  in  the  model  which  follows  the  same  vein 
as  time-eritieal  dynamic  influence  diagrams  [7].  The  modeling  of  time  is  often 
reasonable,  but  what  we  would  really  like  is  a  flexible  modeling  language  to 
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simplify  models  of  problems  with  several  repetitive  structures  especially  for  the 
case  that  models  need  to  be  expanded  over  time.  We  therefore  take  the  notion 
of  object-orientation  to  design  an  efficient  representation  scheme  for  I-TCDIDs. 
The  proposed  design  reduces  the  implementation  complexity  of  the  problem  and 
makes  possible  the  models  self-expansion  and  self-compression. 

2  Background 

I-DID  provides  a  relatively  efficient  method  for  representing  multiagent  sequen¬ 
tial  decision  problems  [2].  Its  static  model,  called  interactive  influence  diagram 
(I-ID),  extends  influence  diagrams  by  introducing  a  new  model  node.  We  show 
one  example  of  I-ID  in  Fig.  1(a).  The  I-ID  model  is  constructed  from  the  view¬ 
point  of  agent  i  that  interacts  with  agent  j.  The  model  node,  A contains 
possible  computable  models  of  other  agent  like  mj  n  .  rn” t  x  in  the  low 
level  l  —  L  Solutions  of  all  models  are  weighted  by  agent  fs  beliefs  on  j’s  models 
and  aggregated  into  chance  node  Aj  (via  the  policy  link).  The  issue  becomes 
complicated  when  I  II)  is  expanded  into  1-D1D  over  time.  As  agent  j  may  act 
and  receive  observations,  its  models  need  to  be  updated  to  reflect  the  new  beliefs. 
We  assume  the  model  node  at  time  t  contains  two  f  s  models  l 

and  rrij  *  j),  and  show  the  model  update  in  Fig.  1(b).  Since  agent  j  may  receive 
any  of  |Oj|(=2)  possible  observations  the  updated  set  at  time  1  +  1  will  become1 
4  models  (m*+i,11  ,  The  four  models  differ  in  their  initial  beliefs. 

The  distribution  over  the  updated  set  of  models  in  the  chance  node  A/o<7[A/J  +  1] 
depends  on  the  distributions  over  j  s  action  and  observation  that,  led  to  those 
models,  and  the  prior  distribution  over  the  models  at  time  step  /.  More  details 
about  1-DID  refers  to  [2]  due  to  the  limited  space  here. 

Current  design  of  I-DIDs  or  most,  probabilistic  graphical  models  are  not 
essentially  rooted  in  the  object-oriented  paradigm.  We  perceive  that  object- 
orientation  conception  would  improve  the  current  design  and  implementation. 
Here  it  is  necessary  to  cover  some  of  basic  concepts.  In  the  object-oriented 
paradigm  the  basic  component  is  an  object,  an  instance  of  a  class.  A  ( lass  is 
a  description  of  objects  with  common  structures,  behaviors  and  attributes,  and 
lias  an  associated  set  of  nodes,  connected  by  links.  In  addition  to  usual  nodes 


Fig.  1.  (a)  A  generic  level  /  >0  1-ID  for  agent  i  with  a  model  node  (Mj.i- i)  and  the 
policy  link  represented  by  the  dashed  arrow,  (b)  Model  update  from  t  to  f-f  1.  Mod[Mj] 
has  the  number  of  j' s  models  as  its  values.  Notice  the  growth  of  models  in  the  model 
node  at  £+1  in  bold. 
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in  probabilistic  graphical  models,  a  class  may  also  contain  special  nodes,  called 
instance  nodes,  representing  instances  of  other  classes.  A  class  instance  repre¬ 
sents  a  network  containing  three  sets  of  nodes  as  defined  in  HUG  IN:  input  nodes, 
output  nodes  and  protected  nodes.  Input  nodes  and  output  nodes  are  the  class 
interfaces  and  used  to  link  the  class  instances  to  other  network  fragments.  They 
must  only  be  decision  or  chance  nodes.  Protected  node  is  the  node  that  only  has 
parents  and  children  inside  the  class.  It  can  be  all  kinds  of  nodes. 

3  Interactive  Time-Critical  Dynamic  Influence 
Diagrams(I-TCDIDs) 

1-TCDID  extends  I-DID  by  including  the  concepts  of  temporal  arcs  and  time 
sequences.  Furthermore,  three  classes(agent  class,  time-slice  class  and  inference 
class)  are  defined  in  I-TCDIDs  to  efficiently  avoid  repetition  of  identical  struc¬ 
tures.  The  use  of  object-oriented  conception  realize  models  self- expansion  and 
self-compression  for  complex  problems. 

Each  node  in  an  I-TCDII)  represents  a  set  of  time-indexed  variables.  Arcs  in 
an  I-TCDID  are  called  temporal  arcs  and  denote  both  probabilistic  and  temporal 
(time-lag)  relations  among  the  variables.  I-TCDID  allows  for  the  coexistence  of 
nodes  with  different  temporal  information  in  the  same  model. 

Often  problems  in  dynamic  multiagent  domain  are  of  a  repetitive  nature,  such 
as  different  agents  of  the  same  type  and  several  time-slices.  Naturally,  these  re¬ 
peated  structure  should  be  modeled  using  agent  class  and  time-slice  class.  As 
mentioned  in  Section  2,  l-DIDs  introduce  a  specific  model  node  representing 
other  agents  models  and  the  models  are  expanded  over  time.  This  would  become 
inflexible  and  redundant  while  many  agents  are  considered.  I-TCDIDs  address 
this  gap  by  allowing  the  representation  of  other  agents  models  as  the  values  of 
instances  of  agent  class(agent  instances).  Time-slice  class  is  a  fragment  continu¬ 
ously  repeated  with  links  between  the  slices  representing  different  time  intervals. 
As  the  same  as  [6],  ail  outer-most  class,  called  inference  class  in  our  work,  should 
be  defined  to  provide  additional  information  and  perform  inference.  Performing 
an  instantiation  of  the  inference  class  gives  us  the  equivalent  of  a  DID,  which 
makes  it  necessary  to  use  ordinary  DID  inference  engines. 

3.1  Agent  Class 

Agent  class  models  common  domain  structures  behaviors  and  attributes  in  the 
domain.  An  instance  of  agent  class  includes  a  set  of  agent  instances  and  some 
usual  nodes  to  assist  inference  and  result  in  an  optimal  strategy  as  the  output. 
A  specific  agent,  say  j* s  instance  node  is  shown  in  Fig.  2.  It  interacts  with  the 
surroundings  by  an  input  node(an  oval  with  heavy  grey  border)  and  an  output 
node(an  oval  with  gray  filling  color).  The  values  of  input  node  S  represents  a  set 
of  current  states  while  output  node  is  a  set  of  optimal  actions  of  agent  j. 

The  nodes  rn1-l_l,  •••,  ,,  are  agent  fs  alternative  computational  models 

ascribed  by  i  in  level  l  —  1.  Each  computational  model  is  an  instance  of  agent 
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Fig.  2.  The  detailed  instance  node  of  agent  j  with  several  computational  inodels(rrt 1  /  j  * 
■■■,  i)  instantiated  from  agent  class 


j  in  the  low  level.  Hence  agent  class  is  defined  in  a  recursive  way.  For  several 
agents  in  an  interacting  environment,  we  could  have  an  agent  instance  for  every 
agent. 


3.2  Time-Slice  Class  and  Inference  Class 


The  basic  building  block  of  I-TCDID  is  a  one  time-interval  network  fragment 
of  a  specific  domain.  It  is  an  instance  of  a  class  called  time-slice  class.  Fig.  3 
shows  a  time-slice  class  with  four  wanted  time- intervals.  The  input  nodes  are 
place-holders  of  variables  in  the  previous  time  step,  while  the  output  nodes 
represent  a  set  of  corresponding  variables  at  the  current  time  step.  Solid  arcs  are 
instantaneous  arcs  and  dashed  arcs  an'  time-lag  arcs  that  model  relationships 
between  nodes  in  continuous  time-slices.  For  instance,  t lie  dashed  arc  bet. ween 
Sl  and  S*~x  represents  the  physical  states  in  current  time-slice  influencing  that 
of  next  time-slice. 

The  nodes  Clj]\x  and  C*j  l  x  are  not  the  actual  interface  nodes.  The  arcs, 
corning  from  and  going  to  the  agent  instance  node,  are  called  influential  airs 
only  representing  the  influential  relationships  between  the  father  and  the  child. 
The  solid  bold  arc  from  x  to  Cj  j  j  is  a  new  arc  called  model  update  are 


Fig.  3.  A  generic  level  /  time-slice  class  for 
agent  i.  Notice  the  model  update  arc  repre¬ 
sented  bv  solid  bold  arc  denotes  the  update 
of  the  models  of  j  and  of  the  distribution 
over  the  models  over  time. 


Fig.  4.  An  agent,  instance  node  in  which 
two  models,  in  J  and  mj.  have  different 
sub-time  sequences  <  1,3  >  and  <  1,2  > 
respectively. 
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(the  time-indexed  model  update  link  [2])  reflecting  updates  of  models  in  agent 
instance  node  between  two  continuous  time-slices.  The  updated  model  node 
demands  only  the  place-holders  S  and  instances  of  agent  j  's  classes  e.g.  rrijj  x 
and  so  on.  The  influential  arc  and  model  update  are  may  be  both  replaced  by 
the  arcs  between  the  usual  nodes. 

Then  we  focus  on  the  conception  of  master-time  sequence  and  sub-time  se¬ 
quence  which  is  used  to  realize  time-abstraction.  The  master-time  sequence  rep¬ 
resents  the  wanted  time-intervals  in  the  modeling  process.  A  sub-time  sequence 
is  a  subset  of  master-time  sequence  and  is  used  to  reduce  unnecessary  infor¬ 
mation  at  specific  time  steps.  In  Fig.  3.  the  node  is  indexed  by  sub-time 

seqiience<  1. 3  >  while  others  are  indexed  by  master-time  soquencc<  1, 2, 3, 4  >. 
For  simplicity,  we  don’t  show  the  master-time  sequence  of  the  nodes. 

Recall  that  the  agent  instance  node  contains  all  candidate  models  of  other 
agents.  These'  models  may  themselves  be'  agent  instances  leading  to  recursive 
modeling.  They  may  be  abstracted  in  a  different  way.  This  requires  to  index 
each  model  with  a  unique  time  sequence  in  the  agent  instance  node.  Assume 
that  agent  j  has  two  candidate  models,  n/j  and  we  show  one  example  of  an 
agent  instance  node  with  different  time- indexed  models  in  Fig.  4.  In  this  case, 
model  nij  is  indexed  by  the  sub-time  sequence  <  1,3  >  w  hile  nij  is  indexed 
by  <  1,2  >.  We  may  also  index  the  instance  node  using  a  single  time  sequence 
if  all  models  share  the  same  sequence.  This  is  exactly  the  ease  in  Fig.  3  where 
Tj7-i  ls  time-indexed  by  <  1.3  >  and  all  models  have  the  same  time  sequence 
<  1,3  >  In  this  case,  Agent  j  may  not  be  considered  in  time  sequence  <  2.4  > 
for  its  negligible  influence.  Agent  j  may  take  actions  for  fewer  time  steps  and 
play  an  intervention  only  at  the  indexed  times.  This  means  that  agent  j  has 
been  temporally  abstracted  by  omitting  its  value  at  some  intermediate  time 
indices. 

Fig.  5  shows  the  deployed  process  of  the  time-slice  class  described  in  Fig.  3. 
We  repeat  a  normal  node  (expect  the  instance  node)  only  if  its  time  sequence  is 
equivalent  to  the  master-time  sequence;  otherwise,  it  will  be  ousted  into  a  deter¬ 
ministic  node  (which  is  deterministically  dependent  on  its  parent  nodes)  for  the 
time  step  where  the  index  value  is  omitted  from  the  time  sequence.  For  the  agent 
instance  node,  we  update  the  model  only  at  the  time  step  if  the  time  is  indexed 
in  the  time  sequence  to  the  model  inside  the  model  node.  Otherwise,  we  retain 
all  models  from  the  previous  time  step  and  do  not  perform  any  model  update 
-  we  also  mark  the  instance  node  using  the  type  of  deterministic  nodes.  There 
is  no  solutions  (actions  performed  by  agents)  from  the  model  at  a  particular 
time  step  which  is  not  indexed  in  the  time  sequence.  For  facilitating  the  CPT 
setting  of  action  node  Ajj  i,  we  assume  a  uniform  distribution  of  actions  from 
the  model,  e.g.  assigning  the  probability  j^j-y  to  the  columns  corresponding  to 
the  model. 

Instances  of  time-slice  class  should  be  encapsulated  by  an  outer-most  class, 
called  inference  class  here,  to  perform  inference.  The  selected  initial  information 
of  inference  class(Fig.  6)  can  be  input  into  variables  of  time-slice  class. 
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Fig.  5.  The  deployed  form  of  time-slice 
class  only  with  two  time-slices.  The  in¬ 
stance  node  Cjj_i  is  represented  by  a  de¬ 
terministic  node.  The  instances  Cj  and  Cj 
are  computational  models  with  different  be¬ 
liefs  and  yet  identical  time-index. 


Fig.  6.  An  inference  class  with  input  in¬ 
formation  of  time-slice  class.  The  input 
node  Afj  represents  the  initial  belief  of 
agent  f s  models.  The  optimal  strategy 
can  be  obtained  by  the  Strategy  node. 


4  Conclusion 

We  propose  a  formal  model  of  I  TCDIDs  to  represent  multiagent  time-critical 
dynamic  decision  problems.  Lho  new  technique  uses  an  object-orientation  con¬ 
cept  to  abstract  the  representation  especially  on  the  model  expansion  over  time. 
It  defines  an  instance  of  inference  and  time-slice  class  based  on  the  concept;  of 
agent  class.  Future  work  would  be  interesting  to  study  the  impact  of  initialization 
on  the  inference  instance  in  I  TCDIDs. 
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Abstract.  In  sequence  labelling,  when  the  label  of  a  token  in  the  se¬ 
quence  is  changed,  the  output  probability  of  the  other  tokens  in  the 
same  sequence  would  also  change.  We  propose  a  new  active  learning 
framework  for  sequence  labelling  which  take  the  change  of  probability 
into  account.  At  each  iteration  of  the  proposed  method,  every  time  the 
human  annotator  manually  annotates  a  token,  the  output  probabilities  of 
the  other  tokens  in  the  sequence  are  re-estimated.  This  proposed  method 
is  expected  to  reduce  the  amount  of  human  annotation  required  for  ob¬ 
taining  a  high  labelling  performance.  Through  experiments  on  the  NP 
chunking  dataset  provided  by  CoNLL,  wo  empirically  show  that  the  pro¬ 
posed  method  works  well. 

Keywords:  active  learning,  sequence  labelling,  semi-supervised  learn¬ 
ing.  partial  annotation,  re-estimation. 


1  Introduction 

Many  natural  language  processing  tasks  such  as  base  NP  chunking,  named  entity 
recognition,  semantic  role  labelling,  can  be  regarded  as  sequence  labelling  bisks. 
The  sequence  labelling  task  is  a  task  to  assign  an  output  label  to  each  token 
in  the  given  input  sequence.  The  accuracy  of  sequence  labelling  depends  on  the 
feature  set  design,  the  labelling  algorithm,  and  also  the  quality  of  the  training  set. 
hi  order  to  obtain  a  good  accuracy,  we  need  a  considerably  large  size  of  labelled 
data  which  can  only  be  obtained  by  expensive  human  annotation.  In  order  to 
reduce  the  amount  of  human  annotation,  active  learning  has  been  proposed  in 
1].  In  active  learning  for  sequence  labelling,  the  system  automatically  selects 
yet-unlabelled  informative  training  sequences  and  asks  the  human  annotator  to 
annotate  the  sequences.  Hence  the  system  can  often  achieve  high  accuracy  with 
a  relatively  small  amount  of  human  annotation  work. 

In  sequence  labelling,  each  output  label  in  a  sequence  is  predicted  with  dif¬ 
ferent  confidence.  If  the  system  is  uncertain  in  predicting  the  label  of  a  token, 
we  should  manually  annotate  the  token.  On  the  other  hand,  we  can  let  the  sys¬ 
tem  automatically  annotate  the  other  tokens.  This  idea  was  implemented  by 

B.-T.  Zhang  and  M.A.  Organ  (Eds.):  PRICAI  2010,  LNAI  (>230.  pp.  (>M  68(>.  2010 
(c)  Springer- Vcrlag  Berlin  Heidelberg  2010 
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Tonianek  and  Hahn  [2].  In  their  method,  if  the  marginal  probability  of  the  pre¬ 
dicted  label  of  a  token  is  low.  this  token  i.s  manually  annotated,  and  the  labels 
of  the  other  tokens  with  high  probability  remain  unchanged.  They  succeeded 
in  reducing  the  required  amount  of  human  annotation.  However,  we  would  like 
to  point  out  that  if  a  token  is  manually  labelled,  the  probability  of  the  output 
itself  i.s  changed  and  also  affects  the  probability  of  labels  of  other  tokens  in  its 
neighbourhood  since  the  labels  are  usually  dependent  on  each  other  in  sequence 
labelling.  If  the  changed  probability  exceeds  the  confidence  threshold,  the  sys¬ 
tem  can  automatically  annotate  such  tokens.  Since  their  method  labels  all  tokens 
with  low  confidence  at  once,  there  is  no  chance  to  re-estimate  the  probability. 
Therefore,  we  may  waste  some  of  the  annotation  effort. 

In  this  paper,  we  propose  a  new  active  learning  framework  for  sequence  la¬ 
belling.  In  the  proposed  algorithm,  an  informative  token  is  selected  by  the  system 
according  to  the  marginal  probability.  The  output  probability  of  other  informa¬ 
tive  tokens  in  the  sequence  are  re-estimated  by  the  system.  After  few  iterations 
of  annotation  the  model  becomes  certain  in  predicting  output  of  all  tokens  in 
the  sequence.  Thus,  we  need  smaller  amount  of  annotation  cost  than  the  cost 
when  all  informative  tokens  are  labelled  at  once. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  discusses  related  works. 
Section  3  describes  the  Conditional  Random  Fields  (CRFs)  algorithm  which  we 
use  as  a  classifier  for  onr  system.  We  propose  our  method  in  detail  in  section  4. 
Section  5  contains  the  experiment  result  and  the  discussion.  Finally,  we  conclude 
our  work  and  discuss  the  future  work  in  Section  6. 

2  Related  Work 

Settles  et,  al.  [3]  had  explored  several  fully-supervised  active  learning  settings 
for  sequence  labelling.  In  contrast,  our  work  is  semi-supervised  learning  which 
requires  fewer  annotation  effort  compared  to  the  supervised  learning.  Onr  work 
is  mostly  related  to  semi-supervised  active  learning  proposed  by  Tomanek  and 
Hahn  in  [2].  The  main  difference  of  our  method  from  their  method  is  the  proba¬ 
bility  re- estimation.  Since  they  annotate  all  informative  tokens  at  once,  there  is 
no  token  with  uncertain  output  left  in  the  sequence. 

Culotta  and  McCallum  [4]  introduced  a  system  which  can  reduce  a  user  effort 
on  structured  prediction  tasks  by  probability  re-estimation.  An  annotator  is 
provided  a  list  of  labelling  candidates  generated  from  the  system,  and  is  asked 
to  correct  errors  in  a  candidate  starting  from  the  least  confident  one.  After  each 
correction,  the  probability  of  the  labelling  is  re-estimated.  However,  an  annotator 
is  required  to  verify  all  of  the  tokens  in  a  candidate.  In  contrast  to  their  method, 
we  automatically  decide  the  output  for  tokens  with  high  confidence  and  only  ask 
an  annotator  to  label  tokens  with  low  confidence. 

3  Conditional  Random  Fields  (CRFs) 

The  objective  of  the  sequence  labelling  task  is  to  find  an  output  label  sequence 
y  =  (y i . yp)  6  Y  of  the  input  sequence  x  =  (a?i, ...,  xt)  £  X.  X  and  Y  are 
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the  sets  of  all  possible  input  and  output  sequences,  respectively.  T  is  the  length 
of  a  sequence.  We  will  learn  the  mapping:  X  — >  Y. 

We  adopt  linear  chain  CRFs  [5]  which  model  the  conditional  probability  of 
output  label  sequence  y  given  input  sequence  x  as 

0«£(x.y) 

Pe{  y|x)  =  -= - .  (1) 

60.x.  Y 

where  y)  :  X  x  Y  — >  W1  is  a  function  from  a  pair  of  input  sequence  x  and 
output  sequence  y  to  a  feature  vector  of  d  dimensions.  Zo.x. y  =  YLy  y  ( °  *  x,y  '• 
is  the  normalizing  factor  which  can  be  computed  efficiently  using  dynamic  pro¬ 
gramming.  0  €  is  a  set  of  model  parameters  learned  from  the  labelled  set  by 
maximum  likelihood  estimation. 


4  Active  Learning  for  Sequence  Labelling 


In  active  learning,  new  sequences  in  each  iteration  arc  chosen  by  a  query  strategy. 
The  query  st  rategy  returns  either  a  sequence  or  a  set  of  sequences  which  are  likely 
to  be  the  most  informative  sequences  for  training.  Following  Toinanek  and  Hahn 
in  [2],  we  will  regard  the  sequence  x  with  the  lowest,  sequence  probability  as 
the  most  informative  sequence.  Then,  we  select  a  set  of  the  most  informative 
sequences  from  the  nil  labelled  set  in  each  iteration. 

Subsequently,  we  divide  tokens  in  the  selected  set  into  informative  and  un¬ 
informative  tokens,  based  on  the  prediction  confidence  of  the  current  model. 
We  define  the  confidence  measure  in  our  work  using  the1  marginal  probability 
computed  as  follows 


Pehli  =  H'\ x)  = 


»j(?/|x)  •  *W|x) 

Ze(x.  Y) 


(2) 


Oj(//|x)  is  the  forward  score,  which  is  the  score  of  the  prefix  sub-sequence  of  x 
to  have  the1  token  at  j  annotated  with  yf .  0j(yr |x)  is  the  backward  score,  which 
is  the  score  of  the  suffix  sub-sequence  of  x  to  have  the  token  at  j  annotated 
with  y'.  Since  our  model  is  linear  chain  CRFs,  a  and  i  are  computed  using  the 
algorithm  similar  to  the  forward- back  ward  algorithm  in  standard  hidden  Markov 
models  [5],  When  the  confidence  of  a  token  is  less  than  the  confidence  threshold 
we  regard  the  token  to  be  informative  and  a  human  annotator  will  annotate 
that  token.  Other  tokens  with  high  confidence  are  automatically  annotated  by 
the  model  We  iteratively  annotate  one  token  at  a  time  starting  from  the  least 
informative  token,  until  then1  is  no  informative  token  left  in  the  sequence. 

Recall  that  a  change  in  output  probability  in  one  token  will  affect  the  output 
probability  of  the  other  tokens  in  the  same  sequence.  By  labelling  a  token,  the 
probability  of  t  hat  token  is  implicit  ly  set  t  o  1 .0  while  the  probabilities  of  the  other 
outputs  of  the  same  token  are  set  to  0.  We  then  re-estimate  the  probability  of 
each  output  label  after  each  manual  annotation  before  any  re-training.  After  re- 
estimation,  if  the  system  predicts  a  label  of  a  token  with  the  probability  higher 
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than  the  threshold  5,  we  assume  that  they  are  correctly  predicted.  We  employ 
the  constrained  Viterbi  algorithm  [4]  for  predicting  output  and  estimating  the 
output  probability.  The  Viterbi  decoding  requires  only  few  milliseconds  and  will 
not  significantly  affect  the  processing  time  of  the  whole  system.  Finally,  we  add 
the  newly  annotated  sequences  to  the  training  set  and  start  the  next  iteration. 
The  learning  will  end  after  wc  have  labelled  all  unannotated  sequences. 


5  Experiments 

5.1  Data,  Pre-processing,  and  Evaluation 

We  use  the  base  NP  chunking  data  from  CoNLL-2000  shared  task.  The  out¬ 
put  labels  are  in  IOB  format  [6].  Onr  feature  set  consists  of  unigrani,  bigram 
and  trigram  word  and  part-of-speech.  We  choose  50  longest  sequences  to  be 
onr  initial  set  since  long  sequences  are  likely  to  contain  more  information  than 
short  sequences.  The  number  of  new  sequences  per  iteration  is  fixed  to  50  in  all 
experiments. 

Performance  of  each  setting  is  evaluated  by  FI  versus  the  number  of  man¬ 
ually  annotated  tokens.  FI  is  measured  following  CoNLL  evaluation  [6].  The 
significance  of  FI  improvement  is  measured  by  McNcnmr’s  test. 

5.2  Active  Learning  Settings 

Wc  employ  CPFs  described  in  section  3  as  the  labelling  model  in  all  settings. 
There  arc  three  baseline  systems.  The  first  baseline  is  Su per ivsed- initial  which 
is  a  supervised  system  using  only  the  initial  set  as  training  data.  The  second 
baseline  is  the  Fully  Supervised  Active  Learning  system  (FuSAL).  All  tokens  in 
each  sequence  are  manually  annotated.  The  last  baseline  is  the  Semi-Supervised 
Active  Learning  system  (SeSAL)  proposed  by  Tomaiiek  and  Hahn  [2).  Firstly,  all 
high  confidence  tokens  which  have  the  output  probabilities  exceed  the  confidence 
threshold  <5,  are  automatically  annotated  by  the  current  model.  Subsequently, 
the  low  confidence  tokens  are  manually  annotated. 

We  propose  the  Semi-Supervised  Active  Learning  with  Probability 
Re  Estimation  system  ( SeSAL-ReEst ).  There  are  two  main  differences  in  SeSAL 
and  SeSAL-ReEst.  The  first  point  is  that  a  human  annotator  labels  one  infor¬ 
mative  token  at  a  time  in  SeSAL-ReEst  but  label  all  informative  tokens  at  once 
in  SeSAL.  The  other  point  is  that,  we  also  re-estimate  the  probability  after  each 
annotation  in  SeSAL-ReEst. 


5.3  Result 

Fig.  1  shows  that  SeSAL-ReEst  achieves  similar  FI  to  SeSAL  with  less  anno¬ 
tation  cost.  According  to  Table  1,  we  can  reduce  3.61%,  18.01%,  and  23.00%  of 
annotation  cost  from  SeSal  when  S  =  0.60,0.90  and  S  =  0.99,  respectively.  Ta¬ 
bic  1  also  shows  the  number  of  mis- labelled  tokens  in  the  training  data  which  is 
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Table  1.  FI  using  all  sequences,  number  of  manually  annotated  tokens,  and  number 
of  niis-labelled  tokens  in  the  training  set  of  each  annotation  setting 


Settings-# 

without  Re-estimation 
FI  %Errtr„i„  %Tagtrnin 

with  Re-estimation 

FI  %ErrtrHi..  %Tagtrnin 

Snpervised-initial 

87.71 

(6.10) 

1.43 

- 

- 

Se.SALrO.tH) 

88.84 

4.72 

1.60 

88.96 

4.73  1.54 

ScSA  /XI.90 

90.96 

3.58 

2.54 

90.78 

3.64  2.08 

SeSAlAW) 

92.45 

2.38 

4.48 

92.45 

2.51  3.42 

FuSAL 

93.86 

0  00 

100.00 

- 

- 

69 
888 
886 

H  884 
882 
88 
878 

3000  3500 

•Manually  annotated  tokens 


CoNLL2000 
threshokJ=0  60 


91 

905 

90 

£  895 
89 

885 

88 


3000  4000  5000  6000 

•Manually  annotated  tokens 


CoNLL2000 
threshold-0  90 


93 
92 
91 

^  90 

89 
88 

4000  6000  8000  10000 

•Manually  annotated  tokens 


CoNLL2000 
threshold-0  99 


(a)  S  =  0.60 


(b)  6  =  0.90 


(c)  6  =  0.99 


Fig.  1.  The  number  of  manually  annotated  tokens  and  /  1  using  SeSAL  and  SeSAI  - 
ReEst  with  confidence  threshold  =  0.60,0.90,0.99 


not  significantly  different  in  SeSAL  and  SeSAL- ReEst.  Since  the  probability  re- 
estimation  increases  the  marginal  probability  of  yet-unlabelled  tokens  to  exceed 
the  confidence  threshold  but  produce  quite  similar  output  labels,  SeSAL- R( Ext 
requires  less  annotation  cost  than  Sv.SAL  but  maintains  the  comparable  FI. 

\\  it li  low  confidence  threshold,  many  erroneous  tokens  are  not  recovered  and 
prevent  the  system  from  achieving  high  FI.  Table  1  also  shows  the  number  of 
errors  in  the  training  set.  With  higher  threshold,  there  are  less  errors  in  the 
training  set  thus  we  can  achieve  higher  FI  than  the  setting  with  low  threshold 
but  with  the  higher  cost  of  annotation  effort. 


6  Conclusion  and  Future  Work 

The  semi-supervised  active  learning  can  reduce  the  human  annotation  cost  by 
selectively  labelling  informative  tokens.  However,  most  of  the  informative  tokens 
are  already  correctly  predicted.  1  he  annotation  and  re-estimation  will  automat¬ 
ically  annotate  these  tokens  without  any  human  effort.  Hence,  the  proposed 
SeSAL- ReEst  outperforms  SeSAL  in  the  terms  of  annotation  cost  to  achieve  a 
certain  level  of  FI. 

The  processing  time  of  probability  re-estimation  per  iteration  is  only  feu 
milliseconds.  However,  the  time  consuming  process  is  the  CRFs  training.  On¬ 
line  learning  which  requires  less  time  in  model  updating  may  be  more  appropriate 
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to  the  active  learning  task.  We  leave  the  improvement  of  the  training  algorithm 
to  future  work. 

Moreover,  even  the  system  with  high  confidence  setting,  we  cannot  achieve 
the  supervised  FI  due  to  many  errors  in  automatically  labelled  tokens  In  other 
words,  the  current  confidence  measure  does  not  succeed  in  selecting  mis-labelled 
tokens.  We  have  to  re-design  the  query  strategy  in  order  to  extract  these  mis¬ 
labelled  tokens  and  have  an  annotator  correct  them. 

Finally,  we  assume  that  the  annotation  difficulty  of  all  tokens  are  the  same.  In 
a  real  scenario,  some  tokens  may  be  harder  to  be  labelled  due  to  its  ambiguity  in 
the  context.  Our  annotation  cost  should  be  re-defined  to  reflect  the  annotation 
difficulty. 
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Abstract,  The  k  nearest  neighbors  classifier  is  simple  and  often  results 
in  good  performance  in  problems.  However,  it  can  not  work  well  on  noisy 
and  high  dimensional  data,  as  the  structure  composed  of  selected  near¬ 
est  neighbors  on  these  data  is  easily  deformed  and  perceptually  unstable. 
This  paper  presents  a  locally  centralizing  samples  approach  with  kernel 
techniques  to  preprocess  the  data.  It  creates  a  new  sample  for  each  origi¬ 
nal  sample  through  its  neighborhood  and  then  replace  it  to  be  candidate 
for  nearest,  neighbors.  This  approach  can  be  justified  by  gestalt  psychol¬ 
ogy  and  applied  to  provide  better  quality  data  for  classifiers,  even  if  the 
original  data  is  noisy  and  high  dimensional. The  conducted  experiments 
on  challenging  benchmark  data  sets  validate  the  proposed  approach. 


1  Introduction 

It  empirically  studied  that  k- nearest  neighbors  (KNN)  classifier  is  simple  and  of¬ 
ten  results  in  good  classification  performance[l].  so  that  its  all  kinds  of  variants 
have  boon  proposed,  such  as  new  measures  designed  to  select  the  optimal  nearest 
neighbors!  1 ,2]  and  local  mean  classifievs(LMC)  proposed  to  resisting  outliers  [3,4]. 
However,  they  heavily  depend  oil  tile  collection  of  selected  neighbors.  The  selected 
nearest  neighbors  on  data  with  the  sparse,  noisy,  or  imbalanced  property  are  eas¬ 
ily  deformed [  10], which  in  turn  leads  to  the  worse  perfonnanee[3]  This  indicates 
that  these  classifiers  are  usually  dependant  on  the  quality  of  the  data  that  they 
operate  on.  so  that  data  preprocessing  is  necessary  to  remove  the  noise  arid  to  fill 
in  the  missing  values.  Generally  eliminating  the  noisy  samples  is  a  hard  problem 
if  without  any  knowledge  of  data  distribution.  This  paper  proposes  a  locally  cen¬ 
tralizing  sample's  (LCS)  approach  to  modify  the  noisy  data  to  normal  data  instead 
of  removing  them,  which  is  then  applied  to  design  enhanced  classifiers. 

2  Locally  Centralizing  Samples  Approach 

All  existing  approaches  to  finding  nearest  neighbors  heavily  depend  on  some  care¬ 
fully  selected  measures [1,2].  However, when  the  training  data  is  noisy  or  sparse, 
the  selected  neighbors  by  these  measures  are  often  conflict  with  human  percep¬ 
tion.  In  such  case,  the  formed  geometry  shape  composed  of  those  selected  neigh¬ 
bors  is  easily  unstable,  shown  as  Fig. 1(B).  When  humans  process  visual  stimuli, 
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Fig.  1.  Principle  of  visual  perceptual  laws  (A)  q  is  regarded  as  independent  object 
of  its  nearest  neighbors  (B)  q  is  taken  as  the  part  of  the  whole  graph,  but  it  is  not 
perceptual  stable  as  it  is  not  at  the  center  of  graph.  (C )q  is  taken  as  the  part  of  the 
whole  graph  visually  with  robust  stability. 


global  information  often  takes  precedence  over  local  information [8].  This  means 
that  we  should  measure  the  point  not  only  by  itself  but  by  its  neighborhood. 
Humans  routinely  classify  others  according  to  both  their  individual  attributes 
and  membership  in  higher  order  groups,  so  that  individual  attributes  may  be 
influenced  and  regulated  by  their  gronp[7].  Generally  noisy  data  is  only  small 
part  of  large  data  so  that  they  can  be  revised  to  normal  data  by  those  nor¬ 
mal  data.  According  to  Gestalt  psychology [6],  symmetry  is  an  imprecise  sense  of 
harmonious  and  balance  such  that  it  reflects  beauty  or  perfection.  Central  sym¬ 
metry  means  that  a  geometric  figure  is  called  a  symmetrical  relatively  a  center, 
if  all  points  are  around  the  center  point.  We  use  this  idea  to  locally  centralizing 
samples  through  its  nearest  neighbors,  and  then  replace  its  original  one  to  be 
candidate  for  nearest  neighbors.  In  this  way,  the  selected  nearest  neighbors  from 
locally  centralized  samples  can  be  more  consistent  with  our  perceptual  law,  so 
that  the  classification  can  be  performed  better.  This  can  be  illustrated  by  Fig.l, 
where  graph  B  is  not  stable  and  we  intend  to  move  the  query  q  to  the  center 
of  formed  neighbor  graph  to  remain  the  stability  as  the  graph  C  shown. Now  we 
give  an  algorithm  to  implement  the  LCS  in  the  context  of  classification  from 
statistics  using  Euclidean  distance[5], denoted  as  ELCS. 

ELCS  (X,£,r) 

/*  X  be  training  samples  and  £(.?-;)  denotes  the  class  of  the  sample  x i  in  X , 
r  be  the  size  for  locally  centralizing  samples*/ 

Step  1.  Select  an  sample  from  X ,  denoted  as  q 

Step  2.  Apply  Euclidean  distance  d€  to  find  r  nearest  neighbors  for  q  with  the 
same  class  label,  denoted  as 

Q{q,d,.,r)  =  {xa(i)  6  X\de(q,x„w)  <  de(q,xa{i+1)),  2  <  i  <  k} 

where  a  be  the  permutation  of  index  of  samples  in  X ,  de{q.xa(i))  <  de(q,Xi ), 
and  €  X. 
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Step  3.  Generate  the  new  sample  for  q  by 


<]h  =  e  ^(<7- ,l'  - r ) 


Stop  L  Repeat  above  all  steps  till  all  now  sample's  are  generated. 

ELCS  can  not  work  well  on  the  nonlinear  data.  This  can  bo  solved  by  using 
kernel  function  kj(x,y)  to  define  the  kernel  distance  dft(a\  y)[9}: 


In  this  way,  can  be  applied  to  define?  the  neighborhood  for  locally  centraliz¬ 
ing  samples.  This  approach  is  called  kernel  locally  centralizing  sainplcs(KLCS) 
approach. 

3  Designed  New  Classifiers 

Theoretically  LCS  acts  as  a  smoother  of  the  distribution  of  training  samples, 
independent  of  classifiers  used.  Here  we  apply  them  to  KNN  and  LMC  classifiers, 
where  LCS  is  only  for  training  samples  while  the  query  sample  keeps  unchanged 
as  its  class  label  is  not  available. 

ELCS-KNN (q,  r,  k) 

/*  A"  is  the  training  sample  set  and  £(.?:;)  denotes  the  class  of  the  sample  .rt 
in  A',  r  be  the  size  for  ELCS  and  k  be  the  neighborhood  size  for  classification*/ 

Step  1.  Generate  new  samples  from  A'  bv  ELCS,  denoted  as  Xb. 

Step  2  Find  k  nearest  neighbors  for  q  from  Xb  using  Euclidean  distance,  denoted 
as  k) 

Step  4.  Classify  q  into  class  uij  if 

Vj=arg  max  {iij  =  Ip*  :  x,-  €  /?(</,&)  A  £(*«)  =  u>j|}} 

,Afc) 

where  uij  is  the  jt Ii  class.  Nc  is  the  number  of  total  classes,  and  |.|  is  the  cardi¬ 
nality  of  the  set. 

KLCS-KNN(<7,  X.  r,  k) 

This  approach  is  the  same  as  ELCS-KNN  except  that  it  generates  new  samples 
from  X  by  KLCS  instead  of  by  ELCS. 

ELCS-LMC (r/.A^,r,  k) 

/*  Ar  is  the  training  sample  set  and  £(:r j)  denotes  the  class  of  the  sample  .rt 
in  Ar,  r  be  the  size  for  ELCS  and  k  be  the  neighborhood  size  for  classification*/ 

Step  l.  Generate  new  samples  from  X  by  ELCS.  denoted  as  Xb. 

Step  2.  Select  k  nearest  neighbors  for  q  from  X1  C  Xb .  denoted  as  {)(q.  k<  u;,). 
where  Xr  is  the  training  sample  subset  from  class  Ui 

Step  3.  Compute  the  local  mean  vector,  tji ,  using  k  nearest  neighbors: 
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Step  4.  Classify  q  into  class  if 

=  nrg niin{ |<7  -  y,\} 

i 

KLCS-LMC  (g,A,£,r,fc) 

This  approach  is  the  same  as  ELCS-LMC  except  that  it  generates  new  samples 
from  Ar  by  KLCS  instead  of  by  ELCS. 

4  Experimental  Results 

4.1  Experimental  Setup 

In  order  to  validate  LCS  approaches,  we  conducted  extensive  experiments  by 
classifiers  on  benchmark  artificial  data  and  real  data.  The  error  rate  is  taken 
as  the  measure  of  performance  of  all  compared  classifiers [5, 3, 4],  In  experiments, 
k  takes  the  value  over  the  range  of  [3.  6,  *  **,30],  and  the  parameter  r  for  LCS 
takes  the  value  from  {1,***,9}.  Kernel  function  type  arc  tried  among 
tin  ear,  poly,  r*6/,  and  sigmoid  kernel  functions,  while  the  kernel  parameters  are 
taken  from  {0.1,  •  *  - ,  0.9, 1,  -  -  ■ ,  9}. When  classifying,  each  data  set  is  divided  into 
training  set  and  testing  set  according  to  the  ’ModApte’  split[ll].  Ten  such  par¬ 
titions  are  generated  randomly  for  the  experiments.  On  each  partition,  the  com¬ 
pared  classifiers  are  trained  and  tested  for  each  pair  of  parameters,  respectively, 
and  then  the  best  performance  is  reported. 

4.2  On  Artificial  Data  Sets 

Using  artificial  data, we  can  control  the  number  of  the  available  samples  and  add 
noise  according  to  the  experimental  purpose.  To  compare  six  classifiers  in  noise 
case,  we  perform  the  experiments  on  two  spiral  pattern  data [13]  and  ring  norm 
data  set [12]  with  200  points  by  adding  random  Gaussian  noise  to  them  where 
thp  mean  of  the  noise  is  0  and  the  variance  is  0.0  0.05  0.1, ....  0.45  respectively.  It 
can  be  observed  from  Tabled  that  on  two  noisy  data,  the  classifiers  enhanced  by 
LCS  performs  obviously  much  better  than  the  original  ones  does  in  terms  of  the 
average  accuracy  and  standard  deviation.  This  means  that  the  classifiers  with 
LCS  is  stronger  to  resisting  in  noise  disturbance.  To  validate  the  better  ability  of 
the  proposed  LCS  to  deal  with  high  dimensional  data,  we  do  experiments  on  ring 
norm  data  set [12]  and  p-dinicnsional  norm  data[4],  as  they  can  be  generated  by 
using  different  dimensions.  It  can  be  observed  from  Tabled  that  the  classifiers 
with  LCS  is  more  robust  to  the  dimensionality  and  shows  a  favorable  behavior 
in  high  dimensions. 

Table  1 .  Accuracies  of  classifiers  on  noisy  and  high  dimensional  artificial  data  set(%) 


Data 

KNN 

ELCS-KNN 

KLCS-KNN 

LMC 

ELCS-LMC 

KLCS-LMC 

spiral  (noise) 
ring(noise) 
p-norm(dim) 
ring(dini) 

82.03±  15.00 
59.1  ii  1.33 
55.53 ±  8.59 

44.91±24.97 

84.25±  13.35 
80.12±  2.64 
79.69±  2.88 
88.90±  2.99 

85.08±12.17 
85.10±2.42 
81.67±3.00 
92.12±  2.23 

79.98±  14.36 
83.36±  4.06 
83.61±  4.62 
92.17±  4.06 

83.52±  12.92 
84.78±  3.66 
85.0G±  3.91 
93.62 ±  3.76 

84.03i  12.27 
86.58±  3.33 
87.03±  2.7G 
94.96±  3.65 
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4.3  Experiments  on  Real  Data  Sets 

To  be  practical,  we  also  perform  experiments  on  benchmark  real  data  sets  from 
UCI  Repository  of  machine  learning  databases!  15].  where  the  records  with  miss¬ 
ing  values  and  nori-nnrneric  attribute  are  all  removed.  It  can  be  observed  from 
Table  2  that  classifiers  enhanced  by  LCS  obviously  outperforms  KNN  and  LMC 
on  average  accuracy.  These  results  do  indicate  the  significant  value  of  the  pro¬ 
posed  idea  and  the  classifier.  This  also  reenforces  the  idea  ever  justified  by  rela¬ 
tive  transformation  14]  that  Gestalt  laws  can  be  geometrically  modeled  and  then 
applied  to  perform  the  classification  better. 


Table  2.  Accuracies  of  six  classifiers  on  real  data  sets  (%) 


Data 

KNN 

ELCS-KNN 

KLCS-KNN 

LMC 

ELCS-LMC 

KLCS-LMC 

wine 

76.35±5.21 

78. 85 ±5.21 

80.58i5.40 

78.27i5.13 

80.96i6.04 

82.12±5.59 

dermatology 

89.15T2.60 

91.42i2.37 

91.5i±2.31 

93.30±2.57 

93.40±2.52 

93. 58 i  2. 59 

diabetes 

75.35±2  1 1 

76.09i2.18 

77.13il.25 

74.96±2.04 

76.26i2.01 

77.04il.61 

ionosphere 

86.06±1.83 

93. 56 i  2.36 

94.23±2.22 

91.06±3.01 

93.65±2.32 

95. 58 i  2. 09 

gliLSS 

69.84  ±6.43 

72.46i5.72 

74.43±5.31 

71.80i5.40 

73.61  ±6.01 

74. 75 ±5. 31 

optdigits 

98.75±0.33 

98.97±0.31 

98.97i0.31 

99.10±0.47 

99.21  ±0.41 

99.21  ±0.41 

.segmentation 

82.54±3.17 

84. 29 ±2. 94 

85.71i3.51 

83.02i3.09 

85. 24 ±3.82 

85.87±4.26 

yeast. 

59.75i2. 15 

60.34  i  2.41 

60. 32 i  2.45 

58.78±2.33 

59. 64 i  2.28 

59. 66  ±2. 12 

yaleface 

65.33±3. 18 

71.56il.16 

72. 89 i  3. 28 

66.44±4.12 

70.44 ±4. 57 

72.67±3.48 

iris 

97.33i0.94 

98.44il.83 

99.78i0.70 

97.33±2.04 

97.78±1  48 

99.56±0.94 

avg 

80.04 ±  2.79 

82.59i  2.04 

83.55i2.67 

8.1 .40±  3.02 

83.01  ±3.1 1 

84.00i2.78 

5  Conclusion  anti  Future  Work 

This  paper  presents  a  locally  centralizing  samples  approach  that  can  effectively 
modify  the  noise  data  to  norm  data  instead  of  removing  them.  This  approach 
also  makes  the  boundary  of  classes  more  separable  so  that  the  imbalanced 
problem  can  be  solved.  This  approach  is  justified  by  gestalt  psychology  which 
means  t  lie*  formed  geomet  ry  of  data  should  be  regular  and  symmetry  as  good  as 
possible]!)].  One  of  its  implementation  ways  is  called  bootstrap  approach  from 
stat;istics[5]. However,  this  approach  is  only  applied  to  design  nearest  neighbor 
classifier  instead  of  k  nearest  neighbors  classifier.  We  applied  LCS  to  design  sev¬ 
eral  classifiers  which  can  work  better  even  if  the  original  data  is  noisy  or  high 
dimensional.  In  the  future,  a  lot  of  techniques  will  be  applied  to  prompt  LCS 
and  then  applied  to  the  advanced  classifiers  such  as  support  vector  machine. 
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Abstract.  Biped  Robot  with  Heterogeneous  Legs  (BRHL)  is  a  novel 
robot  model,  which  consists  of  an  artificial  leg  and  an  intelligent  bionic 
leg.  The  artificial  leg  is  used  to  simulate  the  amputee’s  healthy  leg  and 
the  intelligent  bionic  leg  works  as  the  intelligent  artificial  limb.  This 
paper  discusses  how  a  BRHL  robot  imitates  a  person’s  walking  from  the 
points  of  gait  identification  gait  generation  and  gait  control.  Simulative 
and  practical  system  experiments  prove  the  validity  of  the  presented 
plan  and  proposed  algorithm.  This  robot’s  design  provides  an  excellent 
platform  for  the  research  of  intelligent  prosthetic  leg. 

Keywords:  Intelligent  bionic  leg;  Gait  planning;  Intelligent  prosthesis. 


1  Introduction 

Intelligent  bionic  log  is  used  to  replace  the  malformed  limb  of  amputee  in  the 
domain  of  healing  biomedicine.  Research  of  intelligent  prosthesis  needs  a  lot 
of  various  experiments,  but  the  amputee  cairt  afford  so  many  repeated  exper¬ 
iments,  so  the  progress  of  intelligent  prosthesis  is  undoubtedly  affected.  The 
proposed  biped  Robot  with  Heterogeneous  Legs  (BRHL)  1]  consists  of  ail  arti¬ 
ficial  leg  and  an  intelligent  bionic  leg,  as  shown  in  Fig.  1(a).  The  artificial  leg  is 
used  to  simulate  the  amputee’s  healthy  leg  and  the  intelligent  bionic  leg  works 
as  the  intelligent  artificial  limb. 

The  artificial  leg  lias  six  Degrees  of  Freedom  (DOFs),  the  joints  are  active 
joints  driven  by  motors  and  linked  with  rigid  body.  The  knee  joint  has  multi¬ 
bar  closed-chain  structure.  It  is  a  semi-active  joint.  Biped  robot  is  a  natural 
unstable  system.  In  order  to  simulate  the  situation  that  amputees  walk  in  line 
with  intelligent  prosthesis(IP)  dynamically,  an  assistant  quadricycle  system  is 
designed  to  keep  the  robot  walking  stably.  The  whole  BRHL  system  is  shown  in 
Fig.  1(b). 

This  paper  discusses  how  a  BRHL  robot  imitates  a  person’s  walking  from  the 
points  of  gait  identification,  gait  generation  and  gait  control.  Section  2  describes 
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Fig.  1.  (  a)  Simplified  BRHL  virtual  prototype  (b)  BRHL  experiment  system 


gait  identification  and  planning  of  bionic  leg.  Gait  simulation  as  well  as  united 
control  simulation  of  BRHL  is  conducted  in  Section  3.  Based  on  these  simulat¬ 
ing  experiments,  practical  BRHL  prototype  is  built  and  experiment  results  arc 
depicted  in  Section  4.  Conclusion  and  prospects  are  drawn  in  Section  5. 

2  Gait  Identification  and  Planning  of  Bionic  Leg 

The  common  methods  of  biped  robot’s  gait  planning  includes  the  method  based 
on  gait  data  of  human  body  [2] ;  the  methods  based  oil  the  calculation  of  dynamics 
and  kinematics  [3] [4] ;  the  method  based  on  the  artificial  neural  network  and 
genetic  algorithm [5]  and  the  methods  based  on  Central  Pattern  Generator  [G]  [7] . 

Compared  with  the  common  biped  walking  robot,  according  to  the  BRHL’s 
characteristics,  artificial  leg’s  gait  is  obtained  by  leg  gait  planning  artificially  and 
bionic  leg’s  gait  is  designed  to  follow  the  artificial  leg’s  motion. 


2.1  Gait  Identification  with  Process  Neural  Network 

Gaits  will  have  big  differences  in  different  terrains,  and  each  joint  provides  dif¬ 
ferent  torque.  Five  terrains  of  flat,  up-slope, down-slope, upstairs  and  downstairs 
arc  chosen  here. 

Ground  Reaction  Force  (GRF)  in  different  terrains  is  used  for  gait  identifica¬ 
tion  [8].  61)  force  sensor  in  ankle  joint  of  intelligent  bionic  leg  is  used  to  measure 
three  forces  and  three  torques  from  three  directions.  Then  suitable  gait,  data  arc 
looked  for  from  gait  data  base  according  to  the  terrain,  which  is  used  to  control 
damper  output  force  of  bionic  leg  knee  joint  to  follow  artificial  leg,  If  two  leg 
information  is  not  symmetry,  artificial  leg  regulation  is  need. 

Process  neural  network  is  adopted  for  gait  identification.  Output  layer  of  pro¬ 
cess  neural  networks  completes  space  weight  congregation  of  latent  signals  and 
time  congregation  computation.  Suppose  { &*(£)}  are  a  group  of  base  functions 
of  process  neural  networks  input  space  C[Q,T],  then  weight  functions  could  be 
expressed  as  limited  term  combination  of  the  base  functions. 

Suppose  system  input  is:  X(t)  —  (xi(t),X2(t),  •  •  • , xn(t)) 
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Then  system  output  is: 

m  n  pi  L 

»  =  E,,^£/  (Y.w'iMivJ(i))dt-ol)  (i) 

i  1  j=\.°  1=1 

in  p  i  n 

=  E''/(/  -  0.)  (2) 

i=l  j=1 

Where 

d’ji  =  Y1  (3) 

1= 1 

Network  error  function  is: 

K 

E  =  E  -  d*)2  (4) 

A*-  1 

/\  m.  n 

E(E  *’</(  /  E<1i',J'x*;(<)>rf*  -  -  d*)2  (r>) 

A*  - 1  1=1  ^()  j- 1 

With  gradation  descending  method  network  weight  study  rules  are: 

Vi=Vi  +  aAvi  (G) 

Wji  =  Wj,  T  /Mu’j,  (7) 

ft  =  ft  +  iAOt  (8) 

When',  Vi  is  the  connecting  weight  between  latent  layer  and  output  layer,  u'j, 
is  the  connecting  weight  between  node  j  of  input  layer  and  node  i  of  latent 
layer,  ft  is  the  output  threshold  value  of  latent  layer,  ft  is  the  input  sample  k 
of  desired  output,  n, /3, 7  are  study  efficiencies.  ft(/) is  base  function  n  is  input 
node  number.  m  is  latent  node  number,  K  is  the  division  number  in  [O.T],  L  is 
base  function  number. 

2.2  Gait  Planning  of  Bionic  Leg 

The  hip  joint’s  motion  track  of  bionic  leg  can  be  solved  direct  ly  by  that  of 
artificial  leg: 

o'm  =  m)  + f  (9) 

The  knee  joint's  ideal  motion  track  of  bionic  leg  can  also  be  solved  by  that  of 
artificial  leg  directly: 

r<J0l(t)  =  0*(t)  +  |  (]()) 

Because  control  system  could  only  provide  limited  driving  force,  the  actual  con¬ 
trolling  input  force/force  moment  has  some  constraints  in  the  controlled  object 


model,  especially  in  dynamic  model.  Therefore,  the  state  space  track  the  system 
can  realize  isn't  a  whole  phase  space  and  only  a  subset.  If  the  required  track  ^ 
belongs  to  the  attainable  track  space  ]?,  that  is  S  G  i?,  then  the  ideal  control 
law  can  be  obtained.  Else  a  optimized  control  curve  exists,  which  can  make  the 
practical  track  most  close  to  the  required  track. 

Because  MR  damper  of  bionic  leg  is  a  limited  driving,  it  may  not  realize  an 
ideal  motion  track  To  solve  the  optimized  control  law  U*(t ),  U*(t)  G 

Uad  (  Uad  is  the  allowed  control  set),  and  make  the  knee  joint  motion  0^(t)  of  the 
bionic  leg  follow  refO^(t)),  the  quadric  optimized  performance  index  function  is: 

min  «/(£/)  =  [  {60r 60  +  &0T 60)dt  (11) 

Jo 

In  it, 

80  =  ref^t) -0*0)  (12) 

80  =  refOb(t)  -  %(l)  (13) 

Tlie  optimized  control  vector  is 

u  =  {T%,rf  (14) 

where  Tb  is  the  control  torque  of  bionic  leg’s  knee  joint  and  1  is  the  control 
current  of  damper. 

The  damper  force  F  provided  by  damper  is  related  with  0$,0 3  and  iii]>nt 
current  of  damper.  The  constraint  relationship  is 


F  =  f(Ob(t)Jb(t).I) 

(15) 

Iii  addition,  there  are  the  initial  condition  constraints. 

S0(to)  =  0,  80{to)  =  0 

(16) 

The  damper  current  constraint  is: 

0  <  /  <  2 

(17) 

In  the  practical  calculation,  discretion  of  the  continuous  system  i: 

s  needed. 

71 

min  J(U{-))  =  Y^{80{i)T80{i)  +  80 {if 80 (i)) 
i=  1 

(18) 

U  =  {Tb(i),I(i)f 

(19) 

80{i)  =  rcfO'fi)  -  0b3(i) 

(20) 

80  =  rcJ0b{i)  -  0b(i) 

(21) 

The  solved  0*(t)  corresponding  to  U*(t)  is  called  the  optimized  track  or  the 
extremal  curve. 
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Table  1.  Sample  data  in  five  terrains 


Percent 

GRF  from  vertical  direction  {KN) 

upstairs 

up-slope 

downstairs 

down-slope 

flat 

number 

1 

0.0555 

0.0453 

0.0616 

0.0497 

0.0530 

2 

0.1043 

0.0878 

0.1183 

0.0953 

0.1002 

3 

0.1482 

0.1286 

0.1714 

0.1379 

0.1433 

4 

0.1890 

0.1692 

0.2225 

0.1794 

0.1840 

5 

0.2285 

0.2101 

0.2713 

0.2198 

0.2239 

6 

0.2686 

0.2531 

0.3206 

0.2616 

0.2647 

7 

0.3111 

0.2994 

0.3711 

0.3059 

0.3082ZE 

8 

0.30821 

0.3500 

0.4240 

0.3540 

0.3562 

9 

0.4102 

0.4055 

0.4800 

0.4066 

0.4090 

99 

0.0207 

0.0199 

0.0216 

0.0203 

0.0205 

100 

0.0080 

0.0080 

0.0080 

0.0080 

0.0080 

Fig.  2.  Gait  Tracking  Optimization  of  Knee  .Joint 


3  Gait  Identification  and  Planning  of  Bionic  Leg 

61)  force*  sensor  in  ankle  joint  of  intelligent  bionic  leg  can  measure  the  forces  of 
axis  y  and  as  well  as  three  torques  (A/* ,  A/y,  Afz).  Table  1  is  sample  data 
in  five  terrains. 

The  simulation  example  of  the  optimized  gait  following  is  shown  in  Fig.  2 
The  dashed  line  stands  for  the  ideal  track  of  bionic  leg's  knee  joint  and  the 
real  line  stands  for  the  result  of  gait  following. 

4  Implementation  of  Practical  System  Experiments 

To  validate  bionic  leg's  control  scheme  of  knot*  joint  and  gait's  humanoid  per¬ 
formance  of  swinging  phase,  swinging  and  walking  experiments  are  conducted 
in  condition  of  planned  gait.  The  motion  track  of  knee  joint  of  artificial  leg  and 
intelligent  bionic  leg  is  shown  in  Fig.  3(a) 

It  could  be  seen  that  there  are  many  inflexion  points  in  artificial  curve.  And  the 
track  of  knee  joint  of  intelligent  bionic  leg  is  smooth  because  of  damper  on  it.  And 
the  practical  experiment  result  of  knee  joint  can  only  partly  follow  the  ideal  gait. 
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Fig.  3.  (a)  The  Track  of  knee  joint  of  artificial  leg  and  intelligent  bionic  leg;  (b)  Swing¬ 
ing  phase  experiments  of  knee  joint  of  intelligent  bionic  leg 


Swinging  phase  experiments  is  shown  in  Fig.  3(b).  The  result  indicates  that 
there  is  still  big  error  between  practical  gait  and  desired  gait. 

5  Conclusion 

BHIIL  is  an  integration  of  common  biped  robot  and  intelligent  prosthesis.  It  can 
well  simulate  the  situation  that  human  walks  with  II1.  United  simulation  of  two 
leg  walking  gait  and  swinging  phase  of  artificial  leg  indicates  that  the  simulation 
platform  approaches  to  the  practical  system.  Practical  system  of  BRHL  is  built 
and  practical  experiments  of  swinging  phase  and  walking  are  conducted.  The 
practical  control  experiment  results  are  presented. 
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Abstract.  Alzheimer's  disease  (AD)  is  a  progressively  neuro-degenerative  dis¬ 
order.  In  the  AD-rclated  research,  the  volumetric  analysis  of  hippocampus  is  the 
most  extensive  study.  However,  the  segmentation  and  identification  of  the  hip¬ 
pocampus  arc  highly  complicated  and  time-consuming.  Therefore,  a  MR1  -based 
classification  framework  is  proposed  to  differentiate  between  AD\  patients  and 
normal  individuals.  First,  volumetric  features  and  shape  features  were  extracted 
from  MR1  data.  Afterward,  Principle  component  analysis  (PCA)  was  utilized  to 
decrease  the  dimensions  of  feature  space.  Finally,  a  Back-propagation  artificial 
neural  network  (ANN)  classifier  was  trained  for  AD  classification.  With  the 
proposed  framework,  the  classification  accuracy  is  reached  to  88.27%  by  only 
using  volumetric  features  and  shape  features.  And.  the  result  achieved  up  to 
02.17%  by  using  volumetric  features  and  shape  features  with  the  PCA. 

Keywords:  Al/heimer's  disease,  magnetic  resonance  imaging,  shape  descrip¬ 
tors,  Artificial  Neural  Network,  Principle  component  analysis. 


1  Introduction 

Alzheimer’s  disease  (AD)  is  a  progressively  neuro-degenerative  disorder.  Up  to  pre¬ 
sent,  AD  affects  approximately  26  million  people  worldwide,  and  this  number  may 
increase  fourfold  by  2050. 

Diagnostic  criteria  for  AD  are  currently  based  on  clinical  and  psychometric  as¬ 
sessment.  The  main  procedures  for  the  evaluation  of  probable  AD  patients  are  neuro¬ 
psychological  tests.  In  clinical,  magnetic  resonance  imaging  (MR!)  is  a  very  impor¬ 
tant  tool  in  diagnosing  AD  because  it  can  qualitatively  measure  the  neuronal  loss  by 
the  shrinkage  of  the  structures-of-interest  more  easily.  Consequently,  MR1  has  dem¬ 
onstrated  that  volumetric  atrophy  appears  in  the  early  stages  of  AD  [  1  ]. 

In  addition,  the  enlargement  of  ventricles  is  also  a  significant  characteristic  of  AD 
due  to  neuronal  loss  [2].  Ventricles  are  filled  with  cerebro-spinal  fluid  (CSF)  and 
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surrounded  by  gray  matter  (GM)  and  white  matter  (WM).  As  a  result,  by  measuring 
the  ventricular  enlargement,  hemispheric  atrophy  rate  shows  higher  correlation  with 
the  disease  progression  when  compared  to  the  medial  temporal  lobe  atrophy  rates,  and 
reveals  significant  variation  between  normal  individuals  and  AD. 

In  this  study,  a  MRI-based  classification  framework  is  proposed  to  distinguish 
AD\s  patients  from  normal  individuals.  Section  2  explains  the  proposed  framework 
comprising  system  flowchart  and  selected  shape  features.  Statistical  analysis  and 
experimental  results  are  described  in  Section  3.  Finally,  the  conclusion  is  included  in 
Section  4. 

2  Flow  Chart  and  Feature  Extraction 

Figure  1  illustrates  the  flowchart  of  the  proposed  image-aided  AD  diagnosis  system. 
Details  of  each  step  are  explained  in  the  following. 


Fig.  1.  Flowchart  of  the  proposed  image-aided  AD  diagnosis  system 


2.1  Spatial  Normalization  of  MRI  Data 

Spatial  normalization  is  a  procedure  to  register  a  set  of  MRI  data  to  a  standard  spatial 
coordinate  system,  also  known  as  Talairach  and  Tournoux  coordinate  system  [3]. 
Therefore,  each  voxel  in  the  MRI  data  is  compared  with  the  voxel  at  the  same  posi¬ 
tion  of  other  registered  MRI  data  or  reference-MRl  template.  In  this  study,  all  of  the 
3-D  MRI  sets  were  normalized  to  1CBM  MRI  template  by  using  an  optimum  12- 
parameter  affine  transformation  and  a  Bayesian  framework. 

2.2  Volume  Features 

The  volumes  of  GM,  WM  and  CSF  indicated  important  information,  especially  in 
brain  degeneration  diseases  [4].  Hence,  a  clustering-based  segmentation  algorithm  is 
adopted  to  extract  GM,  WM  and  CSF  probability  maps  from  the  source  MRI  data. 
The  value  of  each  pixel  in  the  corresponding  probability  map  denotes  the  posterior  of 
the  pixel  belonging  to  the  tissue  by  giving  its  gray  intensity.  The  volumes  of  GM, 
WM  and  CSF  and  whole  brain  are  obtained  by  the  following  equations: 

volume^,  =  X(/>(C*™v 1  /('))  >  0.5) 

V«f/ 


(1) 
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volume*,,  =  . Jf(i))>0.5) 

(2) 

Vie/ 

volumeCSF  =  Y{P(Ccs,  I /(/))> 0.5) 

(3) 

Viet 

volume  =  X(/,(C<.w.»v  1  /('»  >  0-5) 

(4) 

Vie  I 


where  /  is  any  pixel  of  the  MR1  data  and  f(i)  stands  for  the  gray  level  of  /.  Figure  2 
illustrates  the  segmentation  results  of  the  normal  individual  and  AD  patient. 


GM  WM  csr 


m 

m 

ii 

n 

Fig.  2.  Segmentation  results  of  the  normal  individual  and  AD  patient 

Binary  ventricle  volume  data,  M(.x ,  y,  z),  are  extracted  from  MR  images  using 
region  growing  algorithm  and  a  threshold  which  was  found  through  double  threshold 
algorithm  [5].  After  the  thresholding,  the  binary  ventricle  regions  are  obtained  using 
the  fill,  erosion  and  dilation  methods.  The  edges  of  binary  images  are  detected  by 
using  the  Sobel  operator  on  a  sliee-by-sliee  basis.  Then  segmented  region  will 
construct  a  mask  image,  where  1  stands  for  the  ventricle  pixel  in  mask  image  and  0 
stands  for  the  non-ventriele  pixel.  Lastly,  Hq.  (5)  is  used  to  measure  the  cerebral 
ventricle,  as  shown  in  Figure  3  (a)  and  (b).  Where  /  is  any  pixel  of  the  mask  data,  M  is 
mask  image  and  f(i)  denotes  for  the  gray  level  of  i. 

VolliniC  Ventricle  =  1  /('))  =  >)  (5) 

Vip  \f 


HD 

(a)  (b)  (c) 

Fig.  3.  (a)  CSF  binary  map,  (b)  ventricle  mask  image,  and  (c)  edge  of  ventricle  mask  image 

2.3  Shape  Features 

In  contrast  to  the  volume  features,  which  are  extracted  from  the  whole  three- 
dimensional  volume,  the  local  shape  features,  such  as  area,  distances  between  salient 
points  and  symmetry,  are  obtained  from  a  single  2-Dimensional  slice  |6]. 

In  the  feature  of  3-D  shape,  we  use  leave-one-out  method  to  construct  training  set 
and  testing  set.  Then  we  build  up  two  sets  of  probability  map  using  Eq.  (6)  and  Eq. 
(8)  for  the  normal  and  patients  in  training  set,  as  shown  in  Figure  4  (a)  and  (b).  Where 
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M  is  the  number  of  normal  controls,  N  is  the  number  of  AD  patients  and  I  represent 
the  grey  value  of  the  ventricular  mask  image. 

^Nnrmaf  V,  Z)  ~  ^  I  Nllrmol  (-*.  (6) 

M 

(7) 

Following,  we  have  the  discriminate  map  by  subtracting  the  normal  probability  map 
from  the  AD  probability  map,  as  shown  in  Figure  4  (c).  Finally,  matching  coefficient 
(MC)  and  the  discriminate  map  are  calculated  using  Fq.  (8).  Here,  D (x\y\Z)  is  the 
discriminate  map  and  T  stands  for  the  testing  ventricular  mask  image. 

^^Nonnolar  AD  =  ^  V,  Z)7/VV»wa/«r  Af)(X,  ")  (8) 


(a)  (b)  (c) 

Fig.  4.  (a)  Probability  of  the  normal  controls,  (b)  probability  of  the  AD  patients  and  (c)  dis¬ 
criminate  map 


In  this  approach,  the  2-D  shape  features  used,  including  (I)  Area ,  (2)  Perimeter , 
(3)  Compactness ,  (4)  Elogafion,(5)  Rectartgularity ,  (6)  Distances ,  (7)  Minimum  thick¬ 
ness,  and  (H)Mean  signature  value. 

2.4  Back-Propagation  Artificial  Neural  Network  Architecture 

In  this  approach,  a  three-layer  BP-ANN  is  employed  for  classification  task.  The  input 
layer  containes  20  neurons,  and  the  output  layer  has  one  neuron.  Hidden  layer  is  com¬ 
posed  of  17  neurons  [7].  The  maximum  iterations  are  set  to  5000  epochs,  and  the 
output  error  of  the  validation  is  less  than  0.01.  The  output  value  is  within  the  range 
(0.0-1. 0).  A  threshold  (in  our  case,  it  is  0.5)  is  applied  to  classify  each  individual.  If 
the  output  value  is  less  than  the  threshold,  the  subject  is  assigned  to  probable  AD 
group;  on  the  other  hand,  the  subject  is  denoted  as  normal  control  group.  The  neural 
classifier  is  trained  10  times  to  get  reliable  results.  Thirty  subjects  (AD  —  12,  Normal 
=  1 8)  are  used  in  the  training  set  randomly. 


3  Experimental  Results 

3.1  Material 

The  whole  dataset  consists  of  two  groups:  24  patients  of  probable  Alzheimer’s  disease 
and  28  normal  controls  of  comparable  age.  Twenty-eight  individuals  are  normal 


Compuier-Aided  Diagnosis  of  AD  Using  Multiple  Features  with  ANN 


703 


controls  (18  males,  10  females),  mean  age  was  67±5.67  years,  with  education  time  of 
IO±4.8  years.  The  average  score  of  MMSE  was  28+1.24.  Twenty-four  individuals 
were  diagnosed  as  probable  AD  patients  (1 1  male,  13  female),  mean  age  was  71  ±7. 37 
years,  w  ith  education  time  of  6.96±5.84  years.  All  patients  were  based  on  the  MMSE 
complemented  by  verbal  memory,  figurative  memory  and  visuospatial  tests.  The 
average  seore  of  MMSE  was  14.38±6.55. 

3.2  Statistical  Analysis  and  Classification 

Mann-Whitney  U  test  was  performed  on  each  feature  to  evaluate  its  discriminative 
power.  The  p-values  obtained  from  the  test  provide  a  generally  known  and 
comparable  criterion  It  rejects  the  null  hypothesis  of  equal  distributions  when  />  < 
0.05.  Table  1  illustrates  the  statistical  results  of  volume  and  shape  features.  In  the 
experiment,  the  circularity  and  reetangularity  are  rejected  ( j )  >  0.05)  in  the  following 
steps  of  classification. 


Table  1.  Statistical  analysis  of  features 


Features 

Mean  volume  in  [mm]  ±  S.D. 

Volume 

Normal 

AD 

p-value 

V{,M 

849.5  ±  62. 1 

776.6  ±  114.3 

0.011 

VWM 

621.6  ±57.3 

534.5  ±71.9 

0.014 

vcsr 

849.6  ±  137.1 

969.8  ±  117.8 

0.038 

Shape 

Normal 

AD 

p-value 

Area 

1581.1  ±268.3 

2206.4  ±713.8 

0.013 

Area  (PR) 

614.4  ±  112.1 

901.7  ±21 1.6 

0.004 

Area  (PL) 

61 1.7  ±  1 18.4 

907.9  ±234.1 

0.001 

Area  (FR) 

132.8  +98.5 

253.9  ±  176.1 

0.008 

Area  (FL) 

140.5  ±76.9 

276.4  ±  191.0 

0.007 

Perimeter 

214.3  ±  18.9 

283.8  ±36.3 

0.013 

Circularity 

43.9  ±5.6 

37.0  ±3.1 

0.027 

Elongation 

1.2  ±0.7 

1.3  ±0.1 

0.022 

Rectangularly 

0.5  ±0.1 

0.6  ±0.1 

0.011 

d(A,G) 

34.7  ±3.1 

39.8  ±  6.4 

0.004 

d(B,G) 

35.1  ±2.9 

42.3  ±5.8 

0.022 

d(C,G) 

37.3  ±2.1 

42.6  ±5.1 

0.026 

d(D,G) 

35.1  ±3.7 

41.3  ±  4  6 

0.029 

d(A,C) 

73.2  ±5.1 

82.4  ±  12.9 

0.016 

d(B,D) 

69.5  ±6.7 

80.9  ±  10.4 

0.003 

Min  thickness 

25.9  ±2.1 

29.5  ±3.7 

0.01 1 

Mean  Sig. 

24.5  ±  2.9 

29.1  ±2.8 

0.014 

In  fact,  some  of  features  may  be  redundant  or  have  highly  correlation.  Therefore, 
PCA  1 8]  was  introduced  to  reduce  the  dimensions  of  the  feature  spaee.  The  principal 
components  which  contribute  95%  to  the  total  variation  in  data  set  were  chosen 
herein.  More  specifically,  to  train  a  volume-feature-based  classification,  all  the  vol¬ 
ume  features  were  adopted.  To  train  a  shape-feature-based  classification,  only  the  first 
five  principal  components  which  eonvey  a  large  amount  of  information  quantified  by 
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95%  energy  were  adopted.  In  the  case  of  using  both  shape  and  volume  features,  the 
first  six  principle  components  were  employed.  For  the  classification,  BP-ANN  was 
utilized  to  train  a  classifier. 

Table  2  shows  the  accuracy,  sensitivity,  and  specificity  when  using  various  fea¬ 
tures.  Obviously,  incorporating  with  shape  features,  volume  features,  and  PCA  shows 
excellent  classification  ability  than  others.  The  accuracy,  sensitivity  and  specificity 
have  been  improved  to  92.17%,  79.91%  and  88.61%,  respectively. 


Table  2.  Classification  results 


Volume 

features 

Shape  fea¬ 
tures 

Volume  +  Shape 
features 

Volume  +  Shape 
features  +  PCA 

Accuracy 

76.03% 

78.92% 

88.27% 

92.17% 

Sensitivity 

73.43% 

80.47% 

76.63% 

79.91% 

Specificity 

78.69% 

71.27% 

87.31% 

88.61% 

4  Conclusions 

In  this  study,  wc  present  a  classification  framework  for  image-aided  diagnosis  for  AD 
by  using  easy-extractable  volume  and  shape  features.  With  the  proposed  framework, 
the  classification  accuracy  is  reached  to  88.27%  by  only  using  volumetric  features  and 
shape  features.  Moreover,  the  correctness  is  up  to  92.17%  by  using  volumetric  fea¬ 
tures  and  shape  features  with  the  aid  of  PCA.  From  the  experimental  results,  it  is 
implied  that  combining  volume  features  and  shape  features  to  classify  AD  is  achiev¬ 
able  due  to  their  low  computational  complexity  and  discriminate  capability. 
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Abstract,  hi  this  paper,  we  propose  a  simple  and  effective  method 
for  speech  understanding.  The  method  incorporates  some  speeeh  rec¬ 
ognizers.  We  use  two  types  of  recognizers;  a  large  vocabulary  eontiuuous 
speech  recognizer  and  a  domaiii-specifie  speech  recognizer.  The  multi¬ 
ple  recognizer  is  a  robust  and  flexible  method  for  speech  understanding. 
Words  in  different  utterances  often  contain  relations.  For  example,  users 
frequently  input  the  parameter  value  after  speaking  command  names  to 
a  system.  We  handle  the  relation  by  a  hierarchical  multiple  recognizer. 
We  compared  the  proposed  method  with  a  non-hierarchical  method.  Our 
method  outperformed  the  non-hierarchieal  method. 

Keywords:  Multiple  speech  recognizer  Output  selection.  Hierarchical 
method. 


1  Introduction 

Speech  understanding  systems  have  been  developed  for  practical  use  recently. 
One  approach  to  develop  speech  understanding  systems  with  higher  accuracy 
is  to  construct  a  speech  understanding  method  using  keywords,  key  phrases,  or 
sentence  templates  1,2].  However  keyword  based  methods  contain  a  problem; 
misunderstanding  of  non-commands  in  a  dialogue.  Here  assume  that  the  word 
“Search”  is  a  command  for  a  system  and  a  user  mutters  “This  is  a  search 
result  that  I  got  yesterday.”  in  front  of  a  microphone  of  the  system.  In  this  case, 
keyword-based  speech  understanding  methods  often  extract  the  word  “search”  in 
the  mutter  as  the  command  for  the  system.  Therefore,  the  speech  understanding 
system  needs  to  detect  non-command  utterances  in  a  dialogue.  Several  utterance 
verification  methods  have  been  proposed  [3,4,5]. 

In  addition,  words  in  different  utterances  often  contain  relations.  For  example, 
users  frequently  input  the  parameter  value  after  speaking  command  names  to  a 
system.  Lane  et.  al.  [6  have  reported  a  hierarchical  topic  classification  method 
for  speech  recognition.  They  used  the  relation  for  the  hierarchical  recognizer. 
The  method  switched  the  language  model  in  the  recognizer  on  the  basis  of  the 
current  topic  in  a  dialogue. 

In  this  paper  we  use  the  speech  understanding  method  proposed  by  [5]  as 
the  basic  approach.  The  task  of  the  speech  understanding  is  an  image  edit  and 
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management  application.  Wc  also  apply  a  hierarchical  approach  to  the  speech 
understanding  method.  For  the  task,  we  compare  the  proposed  method  with  a 
noil-hierarchical  method. 

2  OGSS:  A  Multiple  Speech  Recognizer 

In  this  section,  we  describe  the  basic  idea  of  our  multiple  recognizer.  It  is  based 
on  a  large  vocabulary  continuous  speech  recognizer  (LVCSR)  and  some  domain- 
specific  speech  recognizers  (DSSR)  [5].  We  called  it  “One  Generalist  and  Some 
Specialists  (OGSS)  model”.  In  onr  system,  the  LVCSR  is  the  generalist,  namely 
domain-independent,  and  the  DSSRs  are  specialists,  namely  domain-dependent 
Here  we  use  one  LVCSR  for  non-command  utterances  and  some  DSSRs  for  com¬ 
mand  utterances.  By  using  this  method,  we  can  distinguish  commands  from  a 
chat  (non- co rm li and  utterances). 

In  this  process  we  focus  on  a  difference  of  outputs  generated  from  each  rec¬ 
ognizer.  If  an  input  is  a  command  utterance,  a  DSSR  and  the  LVCSR  generate 
similar  outputs  on  phoneme-level  because  the  LVCSR  is  domain  independent. 
On  the  other  hand,  if  the  input  is  not  a  command  utterance,  they  often  generate 
different  outputs  even  on  the  phoneme-level  because  all  the  DSSRs  never  gener¬ 
ate  the  correct  result  for  non-command  utterances.  In  our  method,  we  compute 
the  edit  distance  of  phonemes  of  utterance-level  and  word-level  by  using  a  DP 
matching  algorithm. 

The  rules  to  judge  an  utterance  are  applied  in  the  following  order: 

1.  Compute  the  edit  distance  of  the  utterance-level  (EDtf/*rr)  between  the 
LVCSR  and  each  DSSR.  For  the  outputs  of  which  the  edit  distance  is  less 
than  threshuMcr,  we  select  the  output  of  the  DSSR  which  contains  the  min¬ 
imum  EDuuer  as  the  final  output. 

2.  Compute  the  edit  distance  of  the  word-level  (ED,iWf/)  between  the  LVCSR 
and  each  DSSR.  For  the  output  of  which  the  edit  distance  is  less  than 
threshUi0rd,  wc  select  the  output  of  the  DSSR  which  contains  the  minimum 
EDWOr<f  as  the  final  output.  Otherwise,  the  LVCSR  as  the  final  output. 

The  ED^ier  is  the  edit  distance  value  on  the  utterance-level.  The  EDworti  is 
the  average  of  the  edit  distance  value  computed  on  word-level.  These  values 
arc  normalized  by  the  number  of  phonemes  in  the  outputs.  The  threshU|/(.r  and 
threshWOrd  are  threshold  values  for  the  judgment  These  values  are  based  on  the 
previous  work  [5].  In  the  paper,  threshMu<T  =  0.26  and  threshM,or,j  =  0.08.  Figure 
1  shows  examples  of  this  process.  S ee  [5]  for  move  details. 

3  Hierarchical  Method 

Words  in  different  utterances  often  contain  a  dependency  relation.  For  example, 
users  frequently  input  the  parameter  value  after  speaking  command  names  to 
a  system.  We  treat  the  relation  by  a  hierarchical  multiple  recognizer.  In  this 
section,  wc  describe  a  hierarchical  method  for  the  OCSS  model  speech  recognizer. 
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Input:  Hello  (a  part  of  a  chat) 

(  LVCSR  Hello  <-  Correell  ) 

^  Not  similar  on  phoneme -lew!.  Judge:  [not  command] 

( DSSR:  Undo  <-  Out  of  vocahulary  N 

\ It  always  makes  a  mistake  / 


Input:  Undo  (a  command  for  the  system) 

(  LVCSR-  and  I  do  <-  LVCSR  often  makes  a  mistake) 

^  ^  Similar  on  phoneme-level,  Judge  [command] 
(DSSR:  Undo  <-  High  accuracy  for  commands  ) 


Input:  Uh-oh,  1  blurred  it  too  much!  (a  part  of  a  chat) 

(  LVCSR  1  blurred  it  too  much  ) 

^  Edit  distance  is  large :  Judge'  [not  command] 

(  DSSR.  Blur  it  ) 

*  In  this  case,  keyword-based  methods  often  extract 
[blur]  from  the  input  as  a  command  for  the  system 


Fig.  1.  Examples  of  the  utterance  verification 


Figure  2  shows  an  example  of  the  hierarchical  method.  A  rectangle  in  the 
figure  denotes  a  speech  recognizer.  The  system  consists  of  some  DSSRs  that  are 
segmented  by  each  command  category  and  a  DSSR  with  all  command  vocab¬ 
ularies.  First,  it  selects  the  output  with  the  minimum  edit  distance  from  the 
segmental ized  DSSRs.  Here,  we  apply  a  threshold  to  the  output  If  the  output 
from  the  segmental  ized  DSSR  contains  high  confidence,  we  select  the  output  as 
the  final  result  of  the  hierarchical  method.  We  regard  the  edit  distance  value 
in  the  OGSS  model  as  the  confidence  measure  for  the  process.  The  threshold  is 
applied  to  the  edit  distance  of  the  utterance-level.  The  threshold  threshcom6  is 
0.14  in  this  paper.  This  value  is  approximately  half  of  the  threshu^er. 

If  the  confidence  of  the  output  of  the  segmentalized  DSSR  is  more  than  the 
thrcshcomfc,  we  use  the  DSSR  with  all  command  vocabularies.  The  reason  is 
that  segmentalized  DSSRs  select  an  incorrect  output  occasionally  because  they 
consist  of  many  DSSRs.  Therefore  we  select  the  output  as  the  final  output  in 
this  method  in  the  case  that  the  output  from  the  segmentalized  DSSIl  is  identical 
with  that  of  the  DSSR  with  all  commands.  By  combining  the  two  types  of  DSSRs. 
we  receive  benefits  of  the  high  word  recognition  accuracy  of  the  segmentalized 
DSSRs  and  the  high  selection  accuracy  with  the  DSSR  with  all  commands. 
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Table  1.  The  utterances 


Com  mauds 

Chats 

Big  up 

Hello 

Rotating  the  images 

Thank  you 

50% 

Ok  1*11  be  there  now 

Our  method  was  based  on  an  assumption  that  there  is  a  relation  in  t  he  input 
sequence  from  users.  For  example,  users  frequently  input  the  parameter  value, 
such  as  ••50%” ,  after  speaking  command  names,  such  as  “Scaling  down”,  to 
a  system.  However,  this  assumption  is  not  always  correct.  Occasionally,  users 
might  input  the  parameter  values  before  speaking  command  names:  e.g.,  “200%. 
Big  iif)  the  image.”  In  another  situation,  users  input  the  command  name  and 
the  parameter  value  at  the  same  time:  e.g.  “Rotate  45  degrees  clockwise.”  We 
deal  with  these  problems.  For  opposite  input  sequences  we  applied  the  following 
rules  to  our  method. 

1.  Our  method  uses  all  DSSRs  in  the  first  layer 

2.  If  the  output  of  the  first  layer  is  words  for  parameter  values  and  the  con¬ 
fidence  is  high  (ED utter  <  0.14),  our  method  waits  for  the  next  utterance. 
Otherwise  reject  the  output. 

3.  If  the  next  utterance,  is  related  with  the  out  put  generated  from  the  previous 
utterance,  our  method  accepts  the  two  inputs. 

For  the  mixed  order  utterances,  we  added  recognizers  that  accept  the  mixed 
grammar  patterns,  such  as  “COMMAND  with  PARAMETER”,  in  the  first  layer. 
For  the  added  recognizers,  we  applied  the  constraint  of  the  confidence  to  the 
output  selection  process.  In  other  words,  our  method  accepts  the  output  from 
the  recognizers  for  the  mixed  order  utterances  only  if  the  confidence  is  high. 


4  Experiments 


In  this  section,  we  evaluated  our  methods  with  88  utterances  about  commands 
and  GO  out-of-domain  utterances  such  as  greetings.  Table  1  shows  examples  of 
commands  and  out-of-domain  utterances,  namely  chats,  in  the  experiment.  The 
number  of  trials  was  5  times  and  the  number  of  test  subjects  was  G.  For  the 
opposite  patterns  and  mixed  order  utterances,  we  used  20  utterances  such  as 
“Rotate  90  degrees  clockwise.”  For  the  additional  test  data,  the  number  of  trails 
was  5  and  the  tost  subject  was  1  person. 

We  used  Julius  as  the  LVCSR  and  Julian  as  the  DSSR  7].  In  this  experiment, 
the  proposed  method  contained  three  layers,  one  command  layer  with  10  DSSRs 
and  two  parameter  layers  with  1G  DSSRs.  We  evaluated  two  criteria  as  follows: 


Accuracy  = 


#  of  commands  recognized  correctly 
#  of  commands 


ChatDcti  d 


#  of  chats  detected  correctly 
#  of  chats 
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Table  2.  The  experimental  result 


Method 

Accuracy 

ChatDetect 

Opposite 

MixUtter 

Non-  H  ierarchical 

85.0 

83.6 

- 

- 

Hierarchical 

89.0 

91.1 

- 

- 

H  ierarchical + 

88.6 

90.3 

70.7 

74.0 

where  the  “commands  recognized  correctly”  denotes  that  the  method  detected 
a  command  utterance  as  “command”  correctly  and  recognized  the  command 
correctly.  The  “chats  detected  correctly”  denotes  that  the  method  detected  a 
chat  utterance  as  “chat”  correctly.  We  ignored  the  word  recognition  accuracy 
for  the  chat  utterances  because  the  word  accuracy  for  the  chats  was  out-of-topio 
in  this  paper. 

Tabic  2  shows  the  experimental  result.  In  the  table,  “Non-H  ierarchical”  de¬ 
notes  a  non-hierarchical  method  based  on  the  OGSS  model.  In  other  words,  it 
consisted  of  two  speech  recognizers:  one  LVCSR  and  one  DSSR  with  all  command 
and  parameter  vocabularies  for  the  application.  “Opposite”  and  “MixUtter”  are 
the  accuracy  of  the  opposite  input  sequences  and  the  accuracy  of  the  mixed 
order  utterances.  “Hierarchical-!-”  denotes  a  method  with  rules  for  opposite  pat¬ 
terns  and  mixed  order  utterances.  Note  that  the  Noil-Hierarchical  and  Hierarchi¬ 
cal  could  not  handle  the  “Opposite”  and  “CombUtter”.  Although  the  accuracy 
arid  Chat  Detect  rates  decreased  slightly,  the  additional  rules  were  effective  for 
the  input  patterns.  The  hierarchical  methods  outperformed  the  non-hierarchical 
method.  This  result  shows  the  effectiveness  of  the  proposed  methods. 

The  key  point  of  our  method  was  the  threshr0m6*  We  examined  the  changes 
of  the  accuracy  and  chat  Detect  rates.  Figure  .‘I  shows  the  experimental  result  .  If 
the  threshcomb  was  large,  the  chatDetect  rate  decreased  dramatically.  However, 
the  accuracy  and  chatDetect  rates  were  stable  in  the  case  that  the  threshromt, 
was  small.  This  result  shows  the  robustness  of  our  method. 


Accuracy  ChatDetect 


Q  06  007  000  0  09  0.10  0  11  0.12  0.13  0  14  0  IS  0  16  0.17  0  18  019  0  20 

ThrCsho»ml. 


Fig.  3.  The  threshold 
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5  Related  Work 

The  system  request  discrimination  method  proposed  by  Sako  et  al.  [4]  was  based 
on  AdaBoost.  The  hierarchical  method  by  Lane  et.  al.  [6]  was  based  on  SVMs. 
Isobe  et  al.  [8]  have  proposed  a  multi-domain  speech  recognition  system  based 
on  the  model  likelihoods  of  the  different  domain  specific  language  models.  In 
general,  the  systems  in  the  previous  studies  not'd  to  recalculate  a  model  to  select 
an  output  or  the  current  topic.  Moreover,  machine  learning  techniques  generally 
need  a  large  amount  of  training  data  to  generate  a  classifier  with  high  accuracy. 
Our  method  only  changes  three  thresholds. 

Kornatani  et  al.  [3]  have  reported  an  utterance  verification  method  based  on 
difference  of  acoustic  likelihood  values  computed  from  two  recognizers.  Using  the 
difference  of  acoustic  likelihood  is  adequate  for  the  verification  task.  Combining 
the  method  based  on  acoustic  likelihood  with  onr  method  is  one  future  work. 


G  Conclusions 

In  this  paper,  we  proposed  a  hierarchical  approach  to  understand  speech  inputs. 
In  addition,  our  method  handled  opposite  input  sequences  and  mixed  order 
utterances.  Our  method  outperformed  the.  non-hierarchical  method. 

In  this  paper,  we  focused  on  only  the  selective  usage  of  the  multiple  speech 
recognizer.  Shirnada  et  al.  [9]  have  reported  an  integrative  usage  (an  anaphora 
resolution  task)  of  the  OGSS  model.  The  context  information  recognized  by  the 
LVCSR  in  chats  is  often  important  for  more  deep  speech  understanding.  Future 
work  includes  acquisition  of  the  context  information  from  input  sequences  and 
the  effective  utilization  of  it. 
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